luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

AF filter when there are multiple ALT alleles? #140

Closed bredelings closed 3 years ago

bredelings commented 3 years ago

Describe the bug

VCF lines are being flagged as AF instead of PASS because one allele has a low frequency, when there is a second ALT allele that has a high frequency.

Version

octopus version 0.7.0 (develop 7d195dee)
Target: x86_64 Linux 2.6.32-642.13.1.el6.x86_64
SIMD extension: AVX2
Compiler: GNU 10.1.0
Boost: 1_74

Example

LT635612        210201  .       G       GT,GTTT 22.37   PASS    AC=1,1;AN=4;DP=109;MQ=58;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|1|2:18:109:58:210201:100:25,45,52,38:0.15,0.28,0.33,0.24:587,284,276:PASS
LT635612        210201  .       GT      G       67.36   PASS    AC=1;AN=4;DP=110;MQ=58;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0|0|0:18:110:58:210201:100:25,45,52,38:0.15,0.28,0.33,0.24:1033,175:PASS
LT635612        210315  .       G       C       175.68  AF      AC=1;AN=4;DP=80;MQ=59;NS=1      GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0|0|0:176:80:59:210201:100:25,45,52,38:0.15,0.28,0.33,0.24:1514,12:AF
LT635612        210873  .       T       A       1740.53 AF      AC=1;AN=2;DP=87;MQ=59;NS=1      GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0:214:87:59:210873:100:214.251,44:0.83,0.17:24,1546:AF
LT635612        211212  .       AG      A       65.97   PASS    AC=1;AN=5;DP=101;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0|0|0|0:66:101:59:211212:95:24,37,29,30,38:0.15,0.23,0.18,0.19,0.24:1625,117:PASS
LT635612        211218  .       G       GA,GAA  93.55   PASS    AC=1,2;AN=5;DP=101;MQ=59;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      2|0|2|1|0:35:101:59:211212:95:24,37,29,30,38:0.15,0.23,0.18,0.19,0.24:1043,301,267:PASS
LT635612        211218  .       GA      G       220.65  PASS    AC=1;AN=5;DP=101;MQ=59;NS=1     GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|1|0|0|0:35:101:59:211212:95:24,37,29,30,38:0.15,0.23,0.18,0.19,0.24:1404,233:PASS
LT635612        211283  .       A       AT,ATT  276.97  AF      AC=1,1;AN=5;DP=84;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|1|2|0:277:84:59:211283:100:41,85,42,22,67:0.16,0.33,0.17,0.087,0.26:1163,135,30:AF
LT635612        211283  .       AT      A       437.98  PASS    AC=1;AN=5;DP=84;MQ=59;NS=1      GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|0|0|0|0:277:84:59:211283:100:41,85,42,22,67:0.16,0.33,0.17,0.087,0.26:1267,139:PASS
LT635612        211416  .       A       G       437.98  AF      AC=1;AN=5;DP=88;MQ=59;NS=1      GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|0|0|1:438:88:59:211283:100:41,85,42,22,67:0.16,0.33,0.17,0.087,0.26:1728,62:AF
LT635612        211591  .       C       *,CA,CAAA       46.18   PASS    AC=1,1,1;AN=5;DP=113;MQ=58;NS=1 GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      3|0|2|0|1:46:113:58:211591:100:77,75,49,108.18,67:0.2,0.2,0.13,0.29,0.18:1201,273,238,183:PASS
LT635612        211591  .       CA      C,*     152.02  PASS    AC=1,1;AN=5;DP=114;MQ=58;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|0|1|2:46:114:58:211591:100:77,75,49,108.18,67:0.2,0.2,0.13,0.29,0.18:1047,308,273:PASS
LT635612        211591  .       CAA     C,*     152.02  PASS    AC=1,1;AN=5;DP=114;MQ=58;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|0|2|1:46:114:58:211591:100:77,75,49,108.18,67:0.2,0.2,0.13,0.29,0.18:1357,273,273:PASS
LT635612        211698  .       A       AC      65.81   AF      AC=1;AN=5;DP=84;MQ=59;NS=1      GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|1|0|0:66:84:59:211591:100:77,75,49,108.18,67:0.2,0.2,0.13,0.29,0.18:1888,20:AF

So here is a record from running polyclone with --max-clones 5. This line:

LT635612        211283  .       A       AT,ATT  276.97  AF      AC=1,1;AN=5;DP=84;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      0|0|1|2|0:277:84:59:211283:100:41,85,42,22,67:0.16,0.33,0.17,0.087,0.26:1163,135,30:AF

gets flagged by the AF < 0.05 filter. However, only one of the two ALT alleles has AF < 0.05.

If there is only one alt allele, then censoring the allele and censoring the line are then same, but here that seems not to be the case. Thoughts?

bredelings commented 3 years ago

To be explicit, the AD counts are 1163,135, and 30. So the allele frequencies are approximate 0.88, 0.1, and 0.02. Clearly 0.02 is small (<0.05), but 0.1 should be fine.

dancooke commented 3 years ago

This is as much to do with the VCF spec as anything else - there's no specified way to filter by alleles or haplotypes. You can use some kind of custom filter naming that refers to each allele (e.g. AF_0, AF_1, ... etc), but no tools will know how to use this. Actually, the spec if pretty vague on what a FILTER'rd record actually means, in particular when there are multiple samples. Octopus also uses FORMAT/FT, and the interpretation is "the called genotype cannot be trusted", so if any of the called alleles look dodgy then the genotype will be filtered.

bredelings commented 3 years ago

This is as much to do with the VCF spec as anything else

Well, I am not shocked.

Octopus also uses FORMAT/FT, and the interpretation is "the called genotype cannot be trusted", so if any of the called alleles look dodgy then the genotype will be filtered.

Good to know!

Thanks for all the help.