Open teepean opened 2 years ago
Hi!
That's an interesting question. I'm not familiar with genotyping using ancient DNA, but I think a challenge might be that KAGE relies on relatively good prior probabilities of genotypes for each variant in the population (and also how likely it is to have a genotype on one variant given the genotype on another variant). I am not sure whether this information in a "modern" database like 1000 Genomes would be accurate/representative enough for a sample of ancient DNA? If the ancient individual is too different from individuals of databases we have available (e.g. 1000 Genomes), one option could maybe be to tune or change these priors based on what is known about the sample.
It should also be noted that KAGE is a lot about speed, so if you only have a few samples you want to genotype, maybe another graph-based approach with full read alignment would be better suited.
I guess maybe pileupCaller and KAGE could be compared if there exists some benchmarking dataset (short reads + "truth" vcf calls) for an ancient DNA sample (or alternatively simulated data). Not sure if that exists?
Hi!
Ancient DNA uses mainly so called 1240k dataset as a reference "10379 unique individuals (6442 ancient, 3937 present-day)" but the dataset is in Eigenstrat format. Some studies seem to use 1000 Ggenomes dataset as well.
As far as I know there are no graph based programs that have been tested with aDNA. Most common method for genotyping seems to be bwa aln + pileupCaller.
vg tested performance on the aDNA with a paper Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph
I just saw the preprint and KAGE looks like a promising alternative to GATK.
Would it be possible to use KAGE to genotype ancient DNA samples? The golden standard at the moment seems to be pileupCaller so what would be the best way to compare the two programs?
https://github.com/stschiff/sequenceTools/