kr-colab / locator

deep learning prediction of geographic location from individual genome sequences
Other
46 stars 18 forks source link

Feature request: Ability to use Genotype likelihoods #12

Open alexpiper opened 4 years ago

alexpiper commented 4 years ago

While most human population genomics datasets are now able to achieve >30x sequencing coverage on the regular, for a lot of non-model organism studies its becoming more popular to instead use low-coverage sequencing (1-5x coverage) and sample more individuals from the population. The most popular methods for analysing these datasets (ANGSD and related software) use the genotype likelihoods directly rather than the called variants in order to better take uncertainty into account.

From a brief twitter discussion with CJ i understand it may be possible to extend Locator to work with genotype likelihoods. I think this feature would be quite valuable to those of us working with low coverage data.

Cheers, Alex

andrewkern commented 4 years ago

Oh yeah we can definitely do this. @cjbattey do you have your hands on a decent training set of low(er) coverage data that we can get geno_liks out of?

cjbattey commented 4 years ago

Yeah I think we can do this pretty easily but TBD. In theory we can just flatten the GL matrix for each individual and pass that to the network instead of the allele count vector we've been using. I don't have a good test dataset for this though. Any ideas?

alexpiper commented 4 years ago

Ive looked into this a bit more over the week. While im used to using the Beagle genotype likelihood format with ANGSD, the VCF spec already includes columns for genotype likelihoods https://github.com/samtools/hts-specs/blob/master/VCFv4.4.pdf

Maybe Locator could have an option when inputting a VCF to choose between the called genotype column (default behaviour), genotype likelihood column (GL) if available, or the Phred scaled genotype likelihoods (PL). The PL is more commonly output by variant callers like GATK and should be back transformable to GLs, but is lossy due to integer rounding.

I noticed that the Anopheles VCFs analysed in the locator MS have the PL column present, so you could potentially use this to develop the functionality on a familiar dataset. If you want to test on some actual low coverage data, Ive had a poke around the literature looking for datasets that may be appropriate:

Note i'm not super familiar with the VCF spec and genotype likelihoods, so forgive me if i'm misunderstanding something.