luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

Incorrect haplotype frequencies with polyclone? #142

Closed bredelings closed 3 years ago

bredelings commented 3 years ago

Describe the bug

Haplotype frequencies, as given by e.g. MAP_HF seem to be incorrect. They disagree both with the evidence BAM and also the AD field. For example, at the first three lines in the block below, the first haplotype (1-0-0) has frequency 0.85 and the second haplotype (1-1-1) has frequency 0.15. However, the AD counts for the alleles in the second haplotype are much less than 15%. For the second locus we have 42/(42+1214) = 3.3% and for the third locus we have 18/(18+1683) = 1.1%

(I wondered if this was somehow a result of the downsampling commands, but they don't seem to make a difference).

LT635612        193308  .       C       G       4502.89 PASS    AC=2;AN=2;DP=162;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:360:162:59:193308:100:285.801,52:0.85,0.15:.,1282:PASS
LT635612        193321  .       T       TC      355.4   AF      AC=1;AN=2;DP=172;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:355:172:59:193308:100:285.801,52:0.85,0.15:1214,42:AF
LT635612        193558  .       C       T       357.86  AF      AC=1;AN=2;DP=139;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:358:139:60:193308:100:285.801,52:0.85,0.15:1683,18:AF
LT635612        193795  .       G       A       3644.72 PASS    AC=1;AN=1;DP=147;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:7:147:59:193795:100:.,1635:PASS
LT635612        194653  .       TC      T       216.34  PASS    AC=1;AN=2;DP=281;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:216:281:59:194653:100:143.245,367.755:0.28,0.72:1051,142:PASS
LT635612        195122  .       T       A       815.72  AF      AC=1;AN=2;DP=142;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:816:142:59:195122:100:88,528.363:0.14,0.86:1802,28:AF
LT635612        195264  .       A       T       554.15  AF      AC=1;AN=2;DP=130;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:554:130:59:195122:100:88,528.363:0.14,0.86:1757,25:AF
LT635612        195681  .       C       G       3531.62 PASS    AC=2;AN=2;DP=119;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:201:119:59:195681:100:253.591,41:0.86,0.14:.,1291:PASS
LT635612        195834  .       T       TA      201.15  AF      AC=1;AN=2;DP=176;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:201:176:59:195681:100:253.591,41:0.86,0.14:1419,39:AF
LT635612        196538  .       A       G       3612.96 PASS    AC=2;AN=2;DP=118;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:552:118:60:196538:100:468.126,116.874:0.8,0.2:.,1299:PASS
LT635612        196622  .       G       A       2634.9  AF      AC=1;AN=2;DP=121;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:552:121:59:196538:100:468.126,116.874:0.8,0.2:25,1476:AF
LT635612        197542  .       A       G       2456.93 AF      AC=1;AN=2;DP=119;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:287:119:59:197542:100:44,229.239:0.16,0.84:15,1491:AF
LT635612        197704  .       G       GA      98.69   AF      AC=1;AN=2;DP=180;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:99:180:59:197704:100:207.423,22:0.91,0.094:2001,21:AF
LT635612        198863  .       CT      C       72.96   AF      AC=1;AN=2;DP=167;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:73:167:59:198863:100:79,807.651:0.089,0.91:1899,18:AF
LT635612        198931  .       T       G       3824.19 PASS    AC=2;AN=2;DP=155;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:73:155:59:198863:100:79,807.651:0.089,0.91:.,1883:PASS

Version

$ octopus --version
octopus version 0.7.0 (develop b83ce113)
Target: x86_64 Linux 2.6.32-642.11.1.el6.x86_64
SIMD extension: SSE2
Compiler: GNU 10.1.0
Boost: 1_74

Command line to run octopus:

$ octopus -I ERR2679006.bam -R ~/malaria/reference/plasmo-combined.fasta -T LT635612 -o chr1.poly2.vcf.gz --bamout chr1.poly2.bam --bamout-type FULL  --filter-expression "QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 | AD < 1" --annotations AD -C polyclone  --max-clones 2 --threads 15 --downsample-above=300 --downsample-target=200
bredelings commented 3 years ago

I have a fasta reference and a subset of the bam file, but combined it is 50Mb which is too big to post here.

dancooke commented 3 years ago

What are the results without the downsampling?

bredelings commented 3 years ago

I ran this:

$ octopus -I ERR2679006.bam -R ~/malaria/reference/plasmo-combined.fasta -T LT635612:190000-210000 -o chr1.poly2b.vcf.gz --filter-expression "QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 | AD < 1" --annotations AD -C polyclone  --max-clones 2

I get this output

...
LT635612        193308  .       C       G       4502.89 PASS    AC=2;AN=2;DP=162;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:360:162:59:193308:100:285.801,52:0.85,0.15:.,1282:PASS
LT635612        193321  .       T       TC      355.4   AF      AC=1;AN=2;DP=172;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:355:172:59:193308:100:285.801,52:0.85,0.15:1214,42:AF
LT635612        193558  .       C       T       357.86  AF      AC=1;AN=2;DP=139;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:358:139:60:193308:100:285.801,52:0.85,0.15:1683,18:AF
LT635612        193795  .       G       A       3644.72 PASS    AC=1;AN=1;DP=147;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:AD:FT 1:7:147:59:193795:100:.,1635:PASS
LT635612        194653  .       TC      T       216.34  PASS    AC=1;AN=2;DP=281;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:216:281:59:194653:100:143.245,367.755:0.28,0.72:1051,142:PASS
LT635612        195122  .       T       A       815.72  AF      AC=1;AN=2;DP=142;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:816:142:59:195122:100:88,528.363:0.14,0.86:1802,28:AF
LT635612        195264  .       A       T       554.15  AF      AC=1;AN=2;DP=130;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:554:130:59:195122:100:88,528.363:0.14,0.86:1757,25:AF
LT635612        195681  .       C       G       3531.62 PASS    AC=2;AN=2;DP=119;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:201:119:59:195681:100:253.591,41:0.86,0.14:.,1291:PASS
LT635612        195834  .       T       TA      201.15  AF      AC=1;AN=2;DP=176;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:201:176:59:195681:100:253.591,41:0.86,0.14:1419,39:AF
LT635612        196538  .       A       G       3612.96 PASS    AC=2;AN=2;DP=118;MQ=60;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:552:118:60:196538:100:468.126,116.874:0.8,0.2:.,1299:PASS
LT635612        196622  .       G       A       2634.9  AF      AC=1;AN=2;DP=121;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:552:121:59:196538:100:468.126,116.874:0.8,0.2:25,1476:AF
LT635612        197542  .       A       G       2456.93 AF      AC=1;AN=2;DP=119;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:287:119:59:197542:100:44,229.239:0.16,0.84:15,1491:AF
LT635612        197704  .       G       GA      98.69   AF      AC=1;AN=2;DP=180;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1:99:180:59:197704:100:207.423,22:0.91,0.094:2001,21:AF
LT635612        198863  .       CT      C       72.96   AF      AC=1;AN=2;DP=167;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:73:167:59:198863:100:79,807.651:0.089,0.91:1899,18:AF
LT635612        198931  .       T       G       3824.19 PASS    AC=2;AN=2;DP=155;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|1:73:155:59:198863:100:79,807.651:0.089,0.91:.,1883:PASS
...

It looks (by eye) like there is no difference in the output, making me think that the downsampling has no effect.

However,

It seems that --downsample-above 10000 --downsample-target 10000 also produces the same output though.

dancooke commented 3 years ago

It looks like the input reads are getting filtered for some other reason, and that there is a bias towards filtering reads supporting the first haplotype. Are there lots of duplicates? If you run with the --debug option, you can check the number of reads that were filtered by each read filter.

bredelings commented 3 years ago

Preliminary indications suggest that there are indeed lots of duplicates, exactly as you suggest.

At site 197,542, IGV seems to indicate that we have 13 As and 65 Gs with a depth of 78, after duplicates are removed. The reference-allele-frequency is then 16.7%, compared with 16% in the octopus output.

It seems that picard might remove more reads as duplicates than octopus does. Octopus has a depth of 119 at that site, instead of something like 78. But it doesn't seem to affect the allele frequencies much in this case.

Presumably octopus is constructing the AD counts using reads marked as duplicates? I was trying to think about a scenario where one would want to include the duplicates in the AD (or HPC) counts, and I'm not really coming up with anything. Certainly if octopus is being advertised as "you don't have to run MarkDuplicates first", counting duplicate reads in AD seems to undercut that.

dancooke commented 3 years ago

By default, Octopus uses read pre-processing for calling but not for filtering (which is where most annotations are computed). The rational for read filtering for calling is that the genotype model does not account for all possible sources of error that might bias genotyping, so it's better to just ignore reads that look suspicious. On the other hand, filtering should ideally consider all of the data since we're essentially doing a model fit test on the called genotypes. Remember that the main purpose of the annotations such as AD - from Octopus's point of view - is filtering. If you don't want this behaviour then you can use the option --use-preprocessed-reads-for-filtering to use pre-processed reads for filtering (and therefore annotations like AD), but be warned that you might not get as good filtering performance from Octopus, especially if using the random forest models.

dancooke commented 3 years ago

Closing due to inactivity. Please re-open if needed.