Closed bredelings closed 3 years ago
I have a fasta reference and a subset of the bam file, but combined it is 50Mb which is too big to post here.
What are the results without the downsampling?
I ran this:
$ octopus -I ERR2679006.bam -R ~/malaria/reference/plasmo-combined.fasta -T LT635612:190000-210000 -o chr1.poly2b.vcf.gz --filter-expression "QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 | AD < 1" --annotations AD -C polyclone --max-clones 2
I get this output
...
LT635612 193308 . C G 4502.89 PASS AC=2;AN=2;DP=162;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|1:360:162:59:193308:100:285.801,52:0.85,0.15:.,1282:PASS
LT635612 193321 . T TC 355.4 AF AC=1;AN=2;DP=172;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:355:172:59:193308:100:285.801,52:0.85,0.15:1214,42:AF
LT635612 193558 . C T 357.86 AF AC=1;AN=2;DP=139;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:358:139:60:193308:100:285.801,52:0.85,0.15:1683,18:AF
LT635612 193795 . G A 3644.72 PASS AC=1;AN=1;DP=147;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:FT 1:7:147:59:193795:100:.,1635:PASS
LT635612 194653 . TC T 216.34 PASS AC=1;AN=2;DP=281;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|0:216:281:59:194653:100:143.245,367.755:0.28,0.72:1051,142:PASS
LT635612 195122 . T A 815.72 AF AC=1;AN=2;DP=142;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|0:816:142:59:195122:100:88,528.363:0.14,0.86:1802,28:AF
LT635612 195264 . A T 554.15 AF AC=1;AN=2;DP=130;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|0:554:130:59:195122:100:88,528.363:0.14,0.86:1757,25:AF
LT635612 195681 . C G 3531.62 PASS AC=2;AN=2;DP=119;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|1:201:119:59:195681:100:253.591,41:0.86,0.14:.,1291:PASS
LT635612 195834 . T TA 201.15 AF AC=1;AN=2;DP=176;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:201:176:59:195681:100:253.591,41:0.86,0.14:1419,39:AF
LT635612 196538 . A G 3612.96 PASS AC=2;AN=2;DP=118;MQ=60;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|1:552:118:60:196538:100:468.126,116.874:0.8,0.2:.,1299:PASS
LT635612 196622 . G A 2634.9 AF AC=1;AN=2;DP=121;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|0:552:121:59:196538:100:468.126,116.874:0.8,0.2:25,1476:AF
LT635612 197542 . A G 2456.93 AF AC=1;AN=2;DP=119;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:287:119:59:197542:100:44,229.239:0.16,0.84:15,1491:AF
LT635612 197704 . G GA 98.69 AF AC=1;AN=2;DP=180;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 0|1:99:180:59:197704:100:207.423,22:0.91,0.094:2001,21:AF
LT635612 198863 . CT C 72.96 AF AC=1;AN=2;DP=167;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|0:73:167:59:198863:100:79,807.651:0.089,0.91:1899,18:AF
LT635612 198931 . T G 3824.19 PASS AC=2;AN=2;DP=155;MQ=59;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT 1|1:73:155:59:198863:100:79,807.651:0.089,0.91:.,1883:PASS
...
It looks (by eye) like there is no difference in the output, making me think that the downsampling has no effect.
However,
--downsample-above 1000 --downsample-target 500
. It seems that --downsample-above 10000 --downsample-target 10000
also produces the same output though.
It looks like the input reads are getting filtered for some other reason, and that there is a bias towards filtering reads supporting the first haplotype. Are there lots of duplicates? If you run with the --debug
option, you can check the number of reads that were filtered by each read filter.
Preliminary indications suggest that there are indeed lots of duplicates, exactly as you suggest.
At site 197,542, IGV seems to indicate that we have 13 As and 65 Gs with a depth of 78, after duplicates are removed. The reference-allele-frequency is then 16.7%, compared with 16% in the octopus output.
It seems that picard might remove more reads as duplicates than octopus does. Octopus has a depth of 119 at that site, instead of something like 78. But it doesn't seem to affect the allele frequencies much in this case.
Presumably octopus is constructing the AD counts using reads marked as duplicates? I was trying to think about a scenario where one would want to include the duplicates in the AD (or HPC) counts, and I'm not really coming up with anything. Certainly if octopus is being advertised as "you don't have to run MarkDuplicates first", counting duplicate reads in AD seems to undercut that.
By default, Octopus uses read pre-processing for calling but not for filtering (which is where most annotations are computed). The rational for read filtering for calling is that the genotype model does not account for all possible sources of error that might bias genotyping, so it's better to just ignore reads that look suspicious. On the other hand, filtering should ideally consider all of the data since we're essentially doing a model fit test on the called genotypes. Remember that the main purpose of the annotations such as AD
- from Octopus's point of view - is filtering. If you don't want this behaviour then you can use the option --use-preprocessed-reads-for-filtering
to use pre-processed reads for filtering (and therefore annotations like AD
), but be warned that you might not get as good filtering performance from Octopus, especially if using the random forest models.
Closing due to inactivity. Please re-open if needed.
Describe the bug
Haplotype frequencies, as given by e.g. MAP_HF seem to be incorrect. They disagree both with the evidence BAM and also the AD field. For example, at the first three lines in the block below, the first haplotype (1-0-0) has frequency 0.85 and the second haplotype (1-1-1) has frequency 0.15. However, the AD counts for the alleles in the second haplotype are much less than 15%. For the second locus we have 42/(42+1214) = 3.3% and for the third locus we have 18/(18+1683) = 1.1%
(I wondered if this was somehow a result of the downsampling commands, but they don't seem to make a difference).
Version
Command line to run octopus: