solgenomics / sgn

The code behind the Sol Genomics Network, Cassavabase and other Breedbase websites
https://solgenomics.net
MIT License
67 stars 35 forks source link

solGS: add option to use genotype data from multiple genotyping protocols #3854

Open isaak opened 2 years ago

isaak commented 2 years ago

-- genotyping protocols have overlapping markers and filtering for shared markers among protocols allows including accessions genotyped using different protocols in analyses pipelines. (request from Marnin)

lukasmueller commented 2 years ago

Can be dangerous because the same coordinate may be a different genomic region in a different protocol

wolfemd commented 2 years ago

Dangerous, I agree. I do this only with care and knowledge. However, here's an example of why:

There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring. In order to make the prediction, the training population phenos+genos and the offspring genos need to be joined. If I guess correctly, that workflow is not currently possible without uploading a "genotyping protocol" that merges the two?

There may be alternative solutions for this, but the standard workflow in the future will generate these disjoint VCF files.

ch728 commented 2 years ago

Yeah I am struggling with the same issue. My solution is to impute everything to the same genotyping protocol. It's a bit of a pain, but you would need to impute everything to the same marker set before running predictions anyway, right?

isaak commented 2 years ago

Dangerous, I agree. I do this only with care and knowledge.

Would you elaborate on the care and knowledge you apply?

isaak commented 2 years ago

There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring. In order to make the prediction, the training population phenos+genos and the offspring genos need to be joined. If I guess correctly, that workflow is not currently possible without uploading a "genotyping protocol" that merges the two?

created a ticket for this: #3858

isaak commented 2 years ago

There is a genotyping protocol "West Africa 2020" that has the NRCRI training population. There is a separate one "NRCRI DarT-GBS 2021" that has the latest NRCRI offspring.

How much is the overlap of clones between successive genotyping protocols? I thought the same clones in the "West Africa 2020" were also in the "NRCRI DarT-GBS 2021".

Would Chris's approach solve this issue in the future?

wolfemd commented 2 years ago

Yeah I am struggling with the same issue. My solution is to impute everything to the same genotyping protocol. It's a bit of a pain, but you would need to impute everything to the same marker set before running predictions anyway, right?

Alternatively, you can merge the VCFs you've imputed separately and then upload them. But it's just going to generate VCFs that get successively larger as subsequent selection cycles get genotyped, imputed and then need to be merged and uploaded. Perhaps they could be periodically overwritten with newer more inclusive files?

wolfemd commented 2 years ago

Dangerous, I agree. I do this only with care and knowledge.

Would you elaborate on the care and knowledge you apply?

Well my whole pipeline for imputation is set-up to generate compatible files. In the NRCRI example that I gave, the imputation reference panel was imputed with Beagle and contains a combination of DArT and GBS-genotyped sites. The offspring were genotyped later using DArT and the ref. panel using to impute them. The progeny that get imputed tend to have a subset of sites in the reference panel, because I do some post-imputation QC.

Equally important is that the reference genome, chromosome+position and Ref-Alt alleles between datasets are a match.

Hopefully that clarifies a bit?