Feature request: Ability to use Genotype likelihoods

alexpiper commented 4 years ago

While most human population genomics datasets are now able to achieve >30x sequencing coverage on the regular, for a lot of non-model organism studies its becoming more popular to instead use low-coverage sequencing (1-5x coverage) and sample more individuals from the population. The most popular methods for analysing these datasets (ANGSD and related software) use the genotype likelihoods directly rather than the called variants in order to better take uncertainty into account.

From a brief twitter discussion with CJ i understand it may be possible to extend Locator to work with genotype likelihoods. I think this feature would be quite valuable to those of us working with low coverage data.

Cheers, Alex

andrewkern commented 4 years ago

Oh yeah we can definitely do this. @cjbattey do you have your hands on a decent training set of low(er) coverage data that we can get geno_liks out of?

cjbattey commented 4 years ago

Yeah I think we can do this pretty easily but TBD. In theory we can just flatten the GL matrix for each individual and pass that to the network instead of the allele count vector we've been using. I don't have a good test dataset for this though. Any ideas?

alexpiper commented 4 years ago

Ive looked into this a bit more over the week. While im used to using the Beagle genotype likelihood format with ANGSD, the VCF spec already includes columns for genotype likelihoods https://github.com/samtools/hts-specs/blob/master/VCFv4.4.pdf

Maybe Locator could have an option when inputting a VCF to choose between the called genotype column (default behaviour), genotype likelihood column (GL) if available, or the Phred scaled genotype likelihoods (PL). The PL is more commonly output by variant callers like GATK and should be back transformable to GLs, but is lossy due to integer rounding.

I noticed that the Anopheles VCFs analysed in the locator MS have the PL column present, so you could potentially use this to develop the functionality on a familiar dataset. If you want to test on some actual low coverage data, Ive had a poke around the literature looking for datasets that may be appropriate:

Human 1000 genomes Phase 1 dataset as analysed in http://www.genome.org/cgi/doi/10.1101/gr.146084.112 This dataset contains ~1000 individuals from ~14 populations with an average coverage of 5× and subsets of it have been used in a number of studies as a benchmark for performance on low coverage data. There is a nice data portal https://www.internationalgenome.org/data-portal/sample where you can pick and choose appropriate populations.
Waterbuck dataset analysed in https://doi.org/10.1534/genetics.118.301336 This dataset contains 73 samples that were sampled at five different sites in Africa with a varying sequencing depth from 2.23 to 4.73x. The BAM files are available at https://www.ebi.ac.uk/ena/data/view/PRJEB28089
Atlantic cod dataset analysed in https://doi.org/10.1111/eva.12861 - 306 individuals with an average coverage of 0.67X. Only raw reads available reads here: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA560242/
Hawaiian Planthoppers analysed in https://doi.org/10.1111/mec.15231 184 individuals 5-15x coverage using exon capture. Again only raw reads available from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA341388

Note i'm not super familiar with the VCF spec and genotype likelihoods, so forgive me if i'm misunderstanding something.

kr-colab / locator

Feature request: Ability to use Genotype likelihoods #12