luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

AD allele depth info missing for the reference? #138

Closed bredelings closed 3 years ago

bredelings commented 3 years ago

Describe the bug

This could be a dumb question, but I'm still not quite sure how to get read depths for different alleles.

The new octopus version does output multiples field for AD now, but the reference also seems to have . instead of a number:

LT635612        205974  .       A       G       6990.45 PASS    AC=1;AN=1;DP=225;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:6990:225:60:205974:100:.,228:PASS

Q1: The AD field now looks like .,228. So there is no read depth for the reference. Is this expected? How is this different than 0,228? Q2: The DP field has a lower count (225) than an individual allele (228). I'm guessing that the total number of reads is calculated different than the reads mapped to individual haplotypes or alleles, so that the total number of reads in AD can be greater than the number of reads in DP? I could ignore this if AD had a count for the reference.

Thanks! -BenRI

Version

$ octopus --version
octopus version 0.7.0 (develop 0271b0fd)
Target: x86_64 Linux 2.6.32-642.13.1.el6.x86_64
SIMD extension: SSE2
Compiler: GNU 10.1.0
Boost: 1_74

Command Command line to install octopus:

$ /scripts/install.py -c gcc-10 -cxx g++-10  --boost /data/wraycompute/malaria/Applications/boost_1_74_0  --gmp /data/wraycompute/malaria/Applications/gmp/local/ --htslib /data/wraycompute/malaria/Applications/htslib/local.new/ --threads 16

Command line to run octopus:

$ octopus -I region.bam -R ~/malaria/reference/plasmo-combined.fasta -o region.g.vcf.gz -P 1 --bamout region.octo.bam --bamout-type FULL  --filter-expression "QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 | AD < 1" --annotations AD
dancooke commented 3 years ago

Q1: The AD field now looks like .,228. So there is no read depth for the reference. Is this expected? How is this different than 0,228?

The missing value . is used since the REF allele isn't called in the GT. The way Octopus counts read assignments means that only called alleles get counts. A value of 0 would mean the calling model actually called the allele - and it therefore appears in GT - but no reads are actually assigned to the allele during realignment.

Q2: The DP field has a lower count (225) than an individual allele (228). I'm guessing that the total number of reads is calculated different than the reads mapped to individual haplotypes or alleles, so that the total number of reads in AD can be greater than the number of reads in DP? I could ignore this if AD had a count for the reference.

The reported value for DP is the raw filtered read depth used for calling (i.e. not realigned). On the other hand, AD is calculated using all valid reads that are realigned to called haplotypes. So yes, the sum of AD (ADP), can be greater than DP. On the other hand, ADP can be less than DP if there are reads that cannot be assigned to a unique allele. This can occur often in repetitive regions when a read doesn't fully span a repeat. ARF is the fraction of assigned reads, so ADP / ARF should be the total count of unfiltered realigned reads overlapping the site.

bredelings commented 3 years ago

The missing value . is used since the REF allele isn't called in the GT. The way Octopus counts read assignments means that only called alleles get counts. A value of 0 would mean the calling model actually called the allele - and it therefore appears in GT - but no reads are actually assigned to the allele during realignment.

OK, with the polyclone model (--max-clones 3) this seems to work. It seems to be 40.3 times slower than the default model.

LT635612        263319  .       T       C       6855.89 PASS    AC=1;AN=1;DP=203;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:4:203:60:263319:100:.,204:PASS
LT635612        264285  .       C       T       7699.71 PASS    AC=1;AN=1;DP=229;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:229:60:264285:100:.,231:PASS
LT635612        264561  .       G       A       7267.13 PASS    AC=1;AN=1;DP=219;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:219:59:264285:100:.,224:PASS
LT635612        265023  .       A       T       9385.37 PASS    AC=1;AN=1;DP=273;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:273:60:264285:100:.,274:PASS
LT635612        265125  .       A       AT,ATT  55.92   PASS    AC=1,1;AN=3;DP=228;MQ=59;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1|2:56:228:59:265125:100:150.516,747.406,94:0.15,0.75,0.095:39,127,23:PASS
LT635612        265711  .       A       AT      46.05   PASS    AC=1;AN=3;DP=265;MQ=58;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|1:46:265:58:265711:100:62,207.663,61:0.19,0.63,0.18:165,47:PASS
LT635612        265711  .       AT      A       46.05   PASS    AC=1;AN=3;DP=265;MQ=58;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0|0:46:265:58:265711:100:62,207.663,61:0.19,0.63,0.18:156,58:PASS
LT635612        266369  .       A       C       61.6    SB;LBQ  AC=1;AN=3;DP=210;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|1:62:210:59:266369:100:22,203.891,28:0.087,0.8,0.11:173,34:SB,LBQ
LT635612        266371  .       T       C       61.6    SB      AC=1;AN=3;DP=214;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|1:62:214:59:266369:100:22,203.891,28:0.087,0.8,0.11:177,34:SB
LT635612        266398  .       C       CA      61.6    PASS    AC=1;AN=3;DP=206;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1|0:62:206:59:266369:100:22,203.891,28:0.087,0.8,0.11:163,21:PASS
LT635612        266756  .       T       C       8050.92 PASS    AC=2;AN=2;DP=246;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:38:246:59:266756:100:28,375.133:0.069,0.93:.,249:PASS
LT635612        266832  .       C       CA      37.5    PASS    AC=1;AN=2;DP=252;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:38:252:60:266756:100:28,375.133:0.069,0.93:213,18:PASS
LT635612        267701  .       G       A       6140.81 PASS    AC=1;AN=1;DP=204;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:204:60:267701:100:.,201:PASS
LT635612        268404  .       G       GA      25.29   PASS    AC=1;AN=2;DP=284;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:25:284:59:268404:100:14,262.537:0.052,0.95:240,18:PASS

As you can see, we still get e.g. .,274 in some instances.

One question is, would octopus ever create a second haplotype supported by a single SNP? Or does it require at least 2 SNPs to create a haplotype? In the organism I'm looking at (vivax malaria) the nucleotide diversity is quite low (theta < 0.001, pi < 0.001) so there aren't going to be a ton of linked sites supported each real difference between the strains.

As background, malaria strains can be mixed together in a single blood sample because of multiple infections (sometimes because people are bit by multiple mosquitos, and a single mosquito can bite multiple people, etc.) and we are trying to determine whether samples have multiple strains and what the proportions are.

dancooke commented 3 years ago

As you can see, we still get e.g. .,274 in some instances.

The record you're referring to:

LT635612        265023  .       A       T       9385.37 PASS    AC=1;AN=1;DP=273;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:42:273:60:264285:100:.,274:PASS

does not call the reference allele (the GT is 1), hence the first entry in AD is missing (.).

One question is, would octopus ever create a second haplotype supported by a single SNP? Or does it require at least 2 SNPs to create a haplotype?

A haplotype can just have a single SNV. The simplest case would just be one variant loci with reads supporting both the reference allele and the alternative allele. That's essentially what's happening in the loci

LT635612        266756  .       T       C       8050.92 PASS    AC=2;AN=2;DP=246;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:38:246:59:266756:100:28,375.133:0.069,0.93:.,249:PASS
LT635612        266832  .       C       CA      37.5    PASS    AC=1;AN=2;DP=252;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:38:252:60:266756:100:28,375.133:0.069,0.93:213,18:PASS

The insertion alone is forcing the addition of a second haplotype since the linked SNV is homozygous. It is of course possible that the insertion is a false positive!

I actually spotted a bug in the above example: the field order of the HPC and MAP_HF annotations are incorrect; they should be reversed so that the insertion haplotype has MAP_HF of 0.069 not the reference. I just fixed this in de417f9025be1e7cf0d674acf0f5b5809c663150.

bredelings commented 3 years ago

The record you're referring to [...] does not call the reference allele (the GT is 1), hence the first entry in AD is missing (.).

Yeah, I get that. I'm not saying this is a bug. However, I am thinking about what to do in case downstream software does not handle "." as an allele count.

Hmm... octopus maps reads->haplotypes->alleles when computing AD counts. Other software maps reads->alleles directly, and so I think would never generate a "." for an AD field. In theory, octopus could map reads->haplotypes, and then take the realigned reads and do reads->alleles, thus letting each read independently support one of REF or ALT alleles. On the plus side, this would allow constructing meaningful AD counts for haploid data without polyclone, and would still benefit from realigning reads. One the minus side, this would add back all the noise that is removed by reducing tons of reads to a few haplotypes. I guess I can see why you do reads->haplotypes->alleles then (if I am understanding this correctly).

A haplotype can just have a single SNV. The simplest case would just be one variant loci with reads supporting both the reference allele and the alternative allele.

OK. good. In that case one approach would then be to replace .,249 with 0,249 as a post-processing step, before submitting the vcfs to downstream software that (maybe) relies on the AD fields all being actual numbers. The justification would be that octopus doesn't see sufficient evidence to create another haplotype supporting the reference allele there, so the best guess is 0.

I am still wondering if rewriting ".,249" to "0,249" loses information. Is it ever possible for an AD field to be 0 currently?

I actually spotted a bug in the above example: the field order of the HPC and MAP_HF annotations are incorrect; they should be reversed so that the insertion haplotype has MAP_HF of 0.069 not the reference. I just fixed this in de417f9.

I see! Thank you. (It took me a few minutes to realize that in 1|1 the number 1 refers to local allele 1, and not haplotype 1, and that you can read each haplotype vertically, yielding the two haplotypes 10 and 11. This might be a common newbie mistake.)

Hmm.... One possible issue here is that the MAP_HF fields do not sum to 1.0. This is presumably because each field is truncated to 2 significant digits. Simply renormalizing seems incorrect, since the smaller field would have greater absolute precision. Maybe increase to 3 significant digits or an absolute precision of 0.001, or something?

(I'm looking at https://github.com/DEploid-dev/DEploid as downstream software, BTW. The source suggests that it assumes that AD field looks like INTEGER,INTEGER, which is dumb. I think it needs a list of curated SNPs with population frequencies, so the VCF is probably going to need to be post-processed in any case. Hmmm...)

dancooke commented 3 years ago

In theory, octopus could map reads->haplotypes, and then take the realigned reads and do reads->alleles, thus letting each read independently support one of REF or ALT alleles.

It's not as simple as it sounds. Which haplotype containing the REF allele do you use to compare the called haplotypes against? You might suggest the reference haplotype, but this will likely be a very bad match if other variants are phased with the variant of interest so you'll get inconsistent results depending on what other variants are nearby. The same thing applies if you try using any of the called haplotypes but switching the ALT allele to REF - which is not trivial either in overlapping cases; there's still no guarantee that the rest of the haplotype will be a good match to the 'bad' reads.

More to the point, what is the purpose of AD and AF? For me, it's to quantify allele imbalance, not contamination, which is what you're presumably trying to get at by havingAD values on uncalled alleles. Identifying contamination is a much harder problem that should be dealt with in other ways. One approach Octopus offers for some calling models is the model posterior (MP), which is the posterior of the calling model compared with the same model +1 haplotype. Unfortunately this is turned off for most calls by default since it's quite expensive to compute. NC and DC for SOMATIC and DENOVO calls try to identify contamination in the normal/parents but only from called somatic/de novo haplotypes . Other statistics of interest just look directly at the alignments to the called haplotypes as a kind of model fit (e.g. ER). I'm also adding a new annotation that just reports the mean log likelihood of reads aligned to each haplotype that will also be useful.

I am still wondering if rewriting ".,249" to "0,249" loses information. Is it ever possible for an AD field to be 0 currently?

I think it's fine even if bit misleading for the reasons outlined above. The AD field can be 0, if the prior was sufficient to call the allele even with no read evidence. This might happen in the trio model if, for example, you has good evidence for 0/1 in the child and 0/0 in one parent, but very low coverage all supporting 0 in the other parent. Then the model may still call 0/1 in the second parent depending on the de novo prior; the model considered it more likely that the allele was just missed in the parent than there being a de novo mutation.

One possible issue here is that the MAP_HF fields do not sum to 1.0. This is presumably because each field is truncated to 2 significant digits.

Thanks, I'll take a look into this.

bredelings commented 3 years ago

More to the point, what is the purpose of AD and AF? For me, it's to quantify allele imbalance, not contamination, which is what you're presumably trying to get at by having AD values on uncalled alleles. Identifying contamination is a much harder problem that should be dealt with in other ways. One approach Octopus offers for some calling models is the model posterior (MP), which is the posterior of the calling model compared with the same model +1 haplotype. Unfortunately this is turned off for most calls by default since it's quite expensive to compute. NC and DC for SOMATIC and DENOVO calls try to identify contamination in the normal/parents but only from called somatic/de novo haplotypes . Other statistics of interest just look directly at the alignments to the called haplotypes as a kind of model fit (e.g. ER). I'm also adding a new annotation that just reports the mean log likelihood of reads aligned to each haplotype that will also be useful.

I would say that I'm really trying to to handle a polyploid situation where the chromosomes are not equally frequent. So if there are three strains, they are likely not (1/3,1/3,1/3). Additionally, I do not know how many strains there are -- that is what we are trying to find out! So, (i) unknown ploidy with (ii) non-equal frequencies.

dancooke commented 3 years ago

The situation you describe is exactly the intended use-case for the polyclone calling model. If you're just interested in strains then AD/AF are not very useful since they are per-allele not per-haplotype - HPC and MAP_HF are per-haplotype Bayesian equivalents.

bredelings commented 3 years ago

I will investigate further. I need to check that the called haplotypes do in fact correspond to the different strains (or perhaps mismapped bits), and also see about persuading the downstream software to handle the HPC/MAP_HF info.