Closed bredelings closed 3 years ago
Q1: The AD field now looks like .,228. So there is no read depth for the reference. Is this expected? How is this different than 0,228?
The missing value .
is used since the REF
allele isn't called in the GT
. The way Octopus counts read assignments means that only called alleles get counts. A value of 0
would mean the calling model actually called the allele - and it therefore appears in GT
- but no reads are actually assigned to the allele during realignment.
Q2: The DP field has a lower count (225) than an individual allele (228). I'm guessing that the total number of reads is calculated different than the reads mapped to individual haplotypes or alleles, so that the total number of reads in AD can be greater than the number of reads in DP? I could ignore this if AD had a count for the reference.
The reported value for DP
is the raw filtered read depth used for calling (i.e. not realigned). On the other hand, AD
is calculated using all valid reads that are realigned to called haplotypes. So yes, the sum of AD
(ADP
), can be greater than DP
. On the other hand, ADP
can be less than DP
if there are reads that cannot be assigned to a unique allele. This can occur often in repetitive regions when a read doesn't fully span a repeat. ARF
is the fraction of assigned reads, so ADP
/ ARF
should be the total count of unfiltered realigned reads overlapping the site.
The missing value . is used since the REF allele isn't called in the GT. The way Octopus counts read assignments means that only called alleles get counts. A value of 0 would mean the calling model actually called the allele - and it therefore appears in GT - but no reads are actually assigned to the allele during realignment.
OK, with the polyclone model (--max-clones 3
) this seems to work. It seems to be 40.3 times slower than the default model.
LT635612 263319 . T C 6855.89 PASS AC=1;AN=1;DP=203;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:FT 1:4:203:60:263319:100:.,204:PASS
LT635612 264285 . C T 7699.71 PASS AC=1;AN=1;DP=229;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:229:60:264285:100:.,231:PASS
LT635612 264561 . G A 7267.13 PASS AC=1;AN=1;DP=219;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:219:59:264285:100:.,224:PASS
LT635612 265023 . A T 9385.37 PASS AC=1;AN=1;DP=273;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:273:60:264285:100:.,274:PASS
LT635612 265125 . A AT,ATT 55.92 PASS AC=1,1;AN=3;DP=228;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1|2:56:228:59:265125:100:150.516,747.406,94:0.15,0.75,0.095:39,127,23:PASS
LT635612 265711 . A AT 46.05 PASS AC=1;AN=3;DP=265;MQ=58;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|0|1:46:265:58:265711:100:62,207.663,61:0.19,0.63,0.18:165,47:PASS
LT635612 265711 . AT A 46.05 PASS AC=1;AN=3;DP=265;MQ=58;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|0|0:46:265:58:265711:100:62,207.663,61:0.19,0.63,0.18:156,58:PASS
LT635612 266369 . A C 61.6 SB;LBQ AC=1;AN=3;DP=210;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|0|1:62:210:59:266369:100:22,203.891,28:0.087,0.8,0.11:173,34:SB,LBQ
LT635612 266371 . T C 61.6 SB AC=1;AN=3;DP=214;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|0|1:62:214:59:266369:100:22,203.891,28:0.087,0.8,0.11:177,34:SB
LT635612 266398 . C CA 61.6 PASS AC=1;AN=3;DP=206;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1|0:62:206:59:266369:100:22,203.891,28:0.087,0.8,0.11:163,21:PASS
LT635612 266756 . T C 8050.92 PASS AC=2;AN=2;DP=246;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|1:38:246:59:266756:100:28,375.133:0.069,0.93:.,249:PASS
LT635612 266832 . C CA 37.5 PASS AC=1;AN=2;DP=252;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:38:252:60:266756:100:28,375.133:0.069,0.93:213,18:PASS
LT635612 267701 . G A 6140.81 PASS AC=1;AN=1;DP=204;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:204:60:267701:100:.,201:PASS
LT635612 268404 . G GA 25.29 PASS AC=1;AN=2;DP=284;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:25:284:59:268404:100:14,262.537:0.052,0.95:240,18:PASS
As you can see, we still get e.g. .,274
in some instances.
One question is, would octopus ever create a second haplotype supported by a single SNP? Or does it require at least 2 SNPs to create a haplotype? In the organism I'm looking at (vivax malaria) the nucleotide diversity is quite low (theta < 0.001, pi < 0.001) so there aren't going to be a ton of linked sites supported each real difference between the strains.
As background, malaria strains can be mixed together in a single blood sample because of multiple infections (sometimes because people are bit by multiple mosquitos, and a single mosquito can bite multiple people, etc.) and we are trying to determine whether samples have multiple strains and what the proportions are.
As you can see, we still get e.g. .,274 in some instances.
The record you're referring to:
LT635612 265023 . A T 9385.37 PASS AC=1;AN=1;DP=273;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:273:60:264285:100:.,274:PASS
does not call the reference allele (the GT
is 1
), hence the first entry in AD
is missing (.
).
One question is, would octopus ever create a second haplotype supported by a single SNP? Or does it require at least 2 SNPs to create a haplotype?
A haplotype can just have a single SNV. The simplest case would just be one variant loci with reads supporting both the reference allele and the alternative allele. That's essentially what's happening in the loci
LT635612 266756 . T C 8050.92 PASS AC=2;AN=2;DP=246;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|1:38:246:59:266756:100:28,375.133:0.069,0.93:.,249:PASS
LT635612 266832 . C CA 37.5 PASS AC=1;AN=2;DP=252;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:38:252:60:266756:100:28,375.133:0.069,0.93:213,18:PASS
The insertion alone is forcing the addition of a second haplotype since the linked SNV is homozygous. It is of course possible that the insertion is a false positive!
I actually spotted a bug in the above example: the field order of the HPC
and MAP_HF
annotations are incorrect; they should be reversed so that the insertion haplotype has MAP_HF
of 0.069
not the reference. I just fixed this in de417f9025be1e7cf0d674acf0f5b5809c663150.
The record you're referring to [...] does not call the reference allele (the GT is 1), hence the first entry in AD is missing (.).
Yeah, I get that. I'm not saying this is a bug. However, I am thinking about what to do in case downstream software does not handle "." as an allele count.
Hmm... octopus maps reads->haplotypes->alleles when computing AD counts. Other software maps reads->alleles directly, and so I think would never generate a "." for an AD field. In theory, octopus could map reads->haplotypes, and then take the realigned reads and do reads->alleles, thus letting each read independently support one of REF or ALT alleles. On the plus side, this would allow constructing meaningful AD counts for haploid data without polyclone, and would still benefit from realigning reads. One the minus side, this would add back all the noise that is removed by reducing tons of reads to a few haplotypes. I guess I can see why you do reads->haplotypes->alleles then (if I am understanding this correctly).
A haplotype can just have a single SNV. The simplest case would just be one variant loci with reads supporting both the reference allele and the alternative allele.
OK. good. In that case one approach would then be to replace .,249
with 0,249
as a post-processing step, before submitting the vcfs to downstream software that (maybe) relies on the AD fields all being actual numbers. The justification would be that octopus doesn't see sufficient evidence to create another haplotype supporting the reference allele there, so the best guess is 0
.
I am still wondering if rewriting ".,249" to "0,249" loses information. Is it ever possible for an AD field to be 0
currently?
I actually spotted a bug in the above example: the field order of the HPC and MAP_HF annotations are incorrect; they should be reversed so that the insertion haplotype has MAP_HF of 0.069 not the reference. I just fixed this in de417f9.
I see! Thank you. (It took me a few minutes to realize that in 1|1 the number 1
refers to local allele 1, and not haplotype 1, and that you can read each haplotype vertically, yielding the two haplotypes 10 and 11. This might be a common newbie mistake.)
Hmm.... One possible issue here is that the MAP_HF fields do not sum to 1.0. This is presumably because each field is truncated to 2 significant digits. Simply renormalizing seems incorrect, since the smaller field would have greater absolute precision. Maybe increase to 3 significant digits or an absolute precision of 0.001, or something?
(I'm looking at https://github.com/DEploid-dev/DEploid as downstream software, BTW. The source suggests that it assumes that AD field looks like INTEGER,INTEGER
, which is dumb. I think it needs a list of curated SNPs with population frequencies, so the VCF is probably going to need to be post-processed in any case. Hmmm...)
In theory, octopus could map reads->haplotypes, and then take the realigned reads and do reads->alleles, thus letting each read independently support one of REF or ALT alleles.
It's not as simple as it sounds. Which haplotype containing the REF
allele do you use to compare the called haplotypes against? You might suggest the reference haplotype, but this will likely be a very bad match if other variants are phased with the variant of interest so you'll get inconsistent results depending on what other variants are nearby. The same thing applies if you try using any of the called haplotypes but switching the ALT
allele to REF
- which is not trivial either in overlapping cases; there's still no guarantee that the rest of the haplotype will be a good match to the 'bad' reads.
More to the point, what is the purpose of AD
and AF
? For me, it's to quantify allele imbalance, not contamination, which is what you're presumably trying to get at by havingAD
values on uncalled alleles. Identifying contamination is a much harder problem that should be dealt with in other ways. One approach Octopus offers for some calling models is the model posterior (MP
), which is the posterior of the calling model compared with the same model +1 haplotype. Unfortunately this is turned off for most calls by default since it's quite expensive to compute. NC
and DC
for SOMATIC
and DENOVO
calls try to identify contamination in the normal/parents but only from called somatic/de novo haplotypes . Other statistics of interest just look directly at the alignments to the called haplotypes as a kind of model fit (e.g. ER
). I'm also adding a new annotation that just reports the mean log likelihood of reads aligned to each haplotype that will also be useful.
I am still wondering if rewriting ".,249" to "0,249" loses information. Is it ever possible for an AD field to be 0 currently?
I think it's fine even if bit misleading for the reasons outlined above. The AD
field can be 0
, if the prior was sufficient to call the allele even with no read evidence. This might happen in the trio model if, for example, you has good evidence for 0/1
in the child and 0/0
in one parent, but very low coverage all supporting 0
in the other parent. Then the model may still call 0/1
in the second parent depending on the de novo prior; the model considered it more likely that the allele was just missed in the parent than there being a de novo mutation.
One possible issue here is that the MAP_HF fields do not sum to 1.0. This is presumably because each field is truncated to 2 significant digits.
Thanks, I'll take a look into this.
More to the point, what is the purpose of AD and AF? For me, it's to quantify allele imbalance, not contamination, which is what you're presumably trying to get at by having AD values on uncalled alleles. Identifying contamination is a much harder problem that should be dealt with in other ways. One approach Octopus offers for some calling models is the model posterior (MP), which is the posterior of the calling model compared with the same model +1 haplotype. Unfortunately this is turned off for most calls by default since it's quite expensive to compute. NC and DC for SOMATIC and DENOVO calls try to identify contamination in the normal/parents but only from called somatic/de novo haplotypes . Other statistics of interest just look directly at the alignments to the called haplotypes as a kind of model fit (e.g. ER). I'm also adding a new annotation that just reports the mean log likelihood of reads aligned to each haplotype that will also be useful.
I would say that I'm really trying to to handle a polyploid situation where the chromosomes are not equally frequent. So if there are three strains, they are likely not (1/3,1/3,1/3). Additionally, I do not know how many strains there are -- that is what we are trying to find out! So, (i) unknown ploidy with (ii) non-equal frequencies.
The situation you describe is exactly the intended use-case for the polyclone calling model. If you're just interested in strains then AD
/AF
are not very useful since they are per-allele not per-haplotype - HPC
and MAP_HF
are per-haplotype Bayesian equivalents.
I will investigate further. I need to check that the called haplotypes do in fact correspond to the different strains (or perhaps mismapped bits), and also see about persuading the downstream software to handle the HPC/MAP_HF info.
Describe the bug
This could be a dumb question, but I'm still not quite sure how to get read depths for different alleles.
The new octopus version does output multiples field for AD now, but the reference also seems to have
.
instead of a number:Q1: The AD field now looks like
.,228
. So there is no read depth for the reference. Is this expected? How is this different than0,228
? Q2: The DP field has a lower count (225) than an individual allele (228). I'm guessing that the total number of reads is calculated different than the reads mapped to individual haplotypes or alleles, so that the total number of reads in AD can be greater than the number of reads in DP? I could ignore this if AD had a count for the reference.Thanks! -BenRI
Version
Command Command line to install octopus:
Command line to run octopus: