luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
299 stars 37 forks source link

somatic benchmark variants are missing for cancer calling model with tumor only bam file #200

Closed ctom442021 closed 2 years ago

ctom442021 commented 2 years ago

Dear Developers of octopus,

I am using octopus (6.2) cancer model with tumor only bam file ( benchmark from here: wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/tutorial_11136.tar.gz) octopus -I tumor.bam -R hg38/chr17.fasta -C cancer -o tumor.octopus.cancer.vcf

The groundtruth variants at low VAF ranges (<40%) are mostly called correctly: chr17 | 2394409 | . | G | T | 92.98 | PASS | AC=1;AN=3;DP=68;MQ=60;MQ0=0;NS=1;PP=13.55;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:68:60:2394409:99:0.22:0.14,0.3:PASS chr17 | 5541887 | . | C | T | 1714.29 | PASS | AC=1;AN=3;DP=68;MQ=60;MQ0=0;NS=1;PP=2.91;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:68:60:5541887:99:0.91:0.85,0.96:PASS chr17 | 6707176 | . | G | C | 109.9 | PASS | AC=1;AN=3;DP=53;MQ=60;MQ0=0;NS=1;PP=7.64;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:53:60:6707176:99:0.28:0.18,0.38:PASS chr17 | 6779878 | . | G | A | 108.87 | PASS | AC=1;AN=3;DP=76;MQ=60;MQ0=0;NS=1;PP=15.12;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:79:76:60:6779878:99:0.21:0.14,0.28:PASS chr17 | 10410710 | . | C | G | 62.2 | PASS | AC=1;AN=3;DP=54;MQ=60;MQ0=0;NS=1;PP=14.48;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:54:60:10410710:99:0.19:0.11,0.28:PASS chr17 | 11990510 | . | C | T | 43.51 | PASS | AC=1;AN=3;DP=31;MQ=60;MQ0=0;NS=1;PP=9.27;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:865:31:60:11990510:99:0.2:0.1,0.32:PASS chr17 | 29609163 | . | G | C | 70.63 | PASS | AC=1;AN=3;DP=51;MQ=60;MQ0=0;NS=1;PP=12.67;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:51:60:29609163:99:0.21:0.13,0.31:PASS chr17 | 34983021 | . | A | G | 61.78 | PASS | AC=1;AN=3;DP=48;MQ=59;MQ0=0;NS=1;PP=5.33;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:189:48:59:34983021:99:0.18:0.11,0.27:PASS chr17 | 69016308 | . | A | G | 89.76 | PASS | AC=1;AN=3;DP=192;MQ=60;MQ0=0;NS=1;PP=13.19;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:192:60:69016308:99:0.13:0.091,0.17:PASS chr17 | 75225975 | . | C | T | 63.03 | PASS | AC=1;AN=3;DP=42;MQ=60;MQ0=0;NS=1;PP=12.81;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:42:60:75225975:99:0.22:0.13,0.33:PASS chr17 | 78532472 | . | C | T | 61.54 | PASS | AC=1;AN=3;DP=42;MQ=60;MQ0=0;NS=1;PP=14.74;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:332:42:60:78532472:99:0.18:0.11,0.26:PASS chr17 | 81511656 | . | G | C | 159.8 | PASS | AC=1;AN=3;DP=100;MQ=60;MQ0=0;NS=1;PP=12.49;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:100:60:81511656:99:0.25:0.18,0.32:PASS chr17 | 81683794 | . | C | T | 56.79 | PASS | AC=1;AN=3;DP=44;MQ=60;MQ0=0;NS=1;PP=12.81;SOMATIC | GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:FT | 0|0|1:999:44:60:81683794:99:0.2:0.11,0.3:PASS

Please have your suggestion what we could adjust any parameter to debug /troubleshoot the reasons or retreive those groundtruth variants at median and high VAF ranges? They are not labelled as SOMATIC (So how octopus work for germline/somatic classification under tumor only model?): chr17 | 4632718 | . | G | A | 95.46 | PASS | AC=1;AN=2;DP=30;MQ=60;MQ0=0;NS=1;PP=6.46 | GT:GQ:DP:MQ:PS:PQ:FT | 1|0:6:30:60:4632718:99:PASS chr17 | 7674220 | . | C | T | 2238.2 | PASS | AC=2;AN=2;DP=76;MQ=60;MQ0=0;NS=1;PP=49.97 | GT:GQ:DP:MQ:PS:PQ:FT | 1|1:50:76:60:7674220:99:PASS chr17 | 50461355 | . | G | A | 893.6 | PASS | AC=2;AN=2;DP=37;MQ=60;MQ0=0;NS=1;PP=20.51 | GT:GQ:DP:MQ:PS:PQ:FT | 1|1:18:37:60:50461355:99:PASS chr17 | 75239434 | . | G | A | 1506.2 | PASS | AC=2;AN=2;DP=54;MQ=60;MQ0=0;NS=1;PP=26.89 | GT:GQ:DP:MQ:PS:PQ:FT | 1|1:23:54:60:75239434:99:PASS chr17 | 76626447 | . | G | T | 1720.3 | PASS | AC=2;AN=2;DP=57;MQ=60;MQ0=0;NS=1;PP=47.05 | GT:GQ:DP:MQ:PS:PQ:FT | 1|1:47:57:60:76626447:99:PASS chr17 | 82374762 | . | C | G | 909.52 | PASS | AC=2;AN=2;DP=38;MQ=60;MQ0=0;NS=1;PP=26.14 | GT:GQ:DP:MQ:PS:PQ:FT | 1|1:26:38:60:82374762:99:PASS chr17 | 19556155 | . | G | A | 64.41 | PASS | AC=1;AN=2;DP=12;MQ=60;MQ0=0;NS=1;PP=9.05 | GT:GQ:DP:MQ:PS:PQ:FT | 1|0:9:12:60:19556155:99:PASS chr17 | 39988669 | . | G | A | 103.64 | PASS | AC=1;AN=2;DP=22;MQ=60;MQ0=0;NS=1;PP=10.81 | GT:GQ:DP:MQ:PS:PQ:FT | 1|0:11:22:60:39988669:99:PASS chr17 | 41349384 | . | G | A | 386 | PASS | AC=1;AN=2;DP=50;MQ=60;MQ0=0;NS=1;PP=13.46 | GT:GQ:DP:MQ:PS:PQ:FT | 1|0:13:50:60:41349384:99:PASS chr17 | 66965103 | . | T | C | 31.36 | PASS | AC=1;AN=2;DP=8;MQ=60;MQ0=0;NS=1;PP=7.31 | GT:GQ:DP:MQ:PS:PQ:FT | 1|0:7:8:60:66965103:99:PASS chr17 | 81671752 | . | A | T | 659.18 | PASS | AC=1;AN=2;DP=76;MQ=60;MQ0=0;NS=1;PP=15.76 | GT:GQ:DP:MQ:PS:PQ:FT | 0|1:16:76:60:81671752:99:PASS

Many thanks!

dancooke commented 2 years ago

Somatic classification becomes harder as the VAF increases due to heterozygous germline variants having expected 50% VAF in diploid regions. The behaviour you observe is expected as a germline variant is more likely than a somatic mutation a priori so variants calls towards 50% VAF are more likely to be classified as germline (see Figure 5 of the Octopus paper). You can try setting --somatic-snv-prior appropriately for the cancer type you're analysing and this may improve classification accuracy, but in general classification for tumour-only calling is hard. Note that all of these calls have low PP (posterior probability) even though most have high QUAL - this tells you that although the variant allele is likely correct, the germline/somatic classification is uncertain.

ctom442021 commented 2 years ago

dancooke, I sincerely appreciate your kind considerations and prompt reply! So PP will be the measure for germline/somatic classification? how does octopus use it toward to somatic against toward to germline? Thanks again!