luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Malformed VCF file / strange allele format #119

Closed DaGaMs closed 4 years ago

DaGaMs commented 4 years ago

Describe the bug I ran the VCF output file of Octopus (0.6.3) through the GATK funcotator, which crashed with a message:

The provided VCF file is malformed at approximately line number 821: unparsable vcf record with allele *ACAC, for input source: test.annotated.vcf.gz

The line in question, with neighbouring lines, is here:

chr1    16599261        .       CTTCTT  C       64.57   RF      AC=1;AN=5;DP=58;MP=0.01;MQ=45;MQ0=0;NS=2;PP=64.57;RFQUAL_ALL=0;SOMATIC  GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT  0|0:169:24:44:16599151:99:.:.,.:0:RF     0|0|1:169:34:46:16599151:99:0.21:0.14,0.28:0.06:RF
chr1    16599508        .       T       TACACACACACACACACAC,*   181.84  RF      AC=1,2;AN=6;DP=75;MP=0;MQ=51;MQ0=0;NS=2;PP=181.84;RFQUAL_ALL=0;SOMATIC  GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT   2|0:14:28:51:16599497:77:.:.,.:0.04:RF  2|0|1|0:14:47:50:16599497:77:0.29:0.2,0.38:0.01:RF
chr1    16599508        .       TACAC   T,*ACAC 179.34  RF      AC=3;AN=6;DP=83;MP=0;MQ=50;MQ0=0;NS=2;PP=179.34;RFQUAL_ALL=0;SOMATIC    GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT  2|0:14:30:51:16599497:77:.:.,.:0.11:RF   2|0|0|1:14:53:49:16599497:77:0.29:0.17,0.35:0.06:RF
chr1    16600147        .       AC      A       1.1     RF      AC=1;AN=5;DP=29;MP=0.91;MQ=31;MQ0=0;NS=2;PP=1.1;RFQUAL_ALL=0;SOMATIC    GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT  0|0:407:10:26:16600147:99:.:.,.:0.01:RF  0|0|1:407:19:33:16600147:99:0.22:0.14,0.31:0.02:RF
chr1    16601940        .       C       T       5.83    RF      AC=1;AN=5;DP=50;MP=31.29;MQ=27;MQ0=0;NS=2;PP=2.59;RFQUAL_ALL=0;SOMATIC  GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT  0|0:998:20:25:16601940:99:.:.,.:0:RF     0|0|1:998:30:29:16601940:99:0.13:0.048,0.23:0.02:RF
chr1    16602561        .       C       T       124.19  RF      AC=1;AN=5;DP=39;MP=13.19;MQ=33;MQ0=0;NS=2;PP=20.89;RFQUAL_ALL=0.01;SOMATIC      GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT   0|0:172:11:34:16602561:99:.:.,.:0.09:RF 0|0|1:172:28:32:16602561:99:0.36:0.25,0.49:0.24:RF
chr1    16604578        .       G       A       19.45   RF      AC=1;AN=5;DP=73;MP=0;MQ=40;MQ0=0;NS=2;PP=19.45;RFQUAL_ALL=0;SOMATIC     GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT  0|0:91:39:40:16604511:99:.:.,.:0:RF      0|0|1:91:34:40:16604511:99:0.086:0.042,0.14:0.09:RF
chr1    16607103        .       AAGAG   A       10.24   RF      AC=1;AN=5;DP=99;MP=24.44;MQ=59;MQ0=0;NS=2;PP=10.24;RFQUAL_ALL=0.03;SOMATIC      GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT   0|0:999:41:59:16607103:99:.:.,.:0.36:RF 0|0|1:999:58:60:16607103:99:0.066:0.028,0.12:0.44:RF
chr1    16611485        .       G       GA,GAA  108.2   RF      AC=2,1;AN=5;DP=135;MP=1.92;MQ=59;MQ0=0;NS=2;PP=439.71;RFQUAL_ALL=0;SOMATIC      GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT   1|0:240:58:58:16611485:99:.:.,.:0.09:RF 1|0|2:240:77:59:16611485:99:0.28:0.22,0.34:0.13:RF
chr1    16619243        .       A       G       1.93    RF      AC=1;AN=5;DP=47;MP=4.77;MQ=38;MQ0=0;NS=2;PP=1.93;RFQUAL_ALL=0;SOMATIC   GT:GQ:DP:MQ:PS:PQ:MAP_VAF:VAF_CR:RFQUAL:FT  0|0:56:15:36:16619178:99:.:.,.:0:RF      0|0|1:56:32:39:16619178:99:0.094:0.042,0.16:0.01:RF

What does *ACAC mean? It does seem a bit of an odd notation?

dancooke commented 4 years ago

This is due to the way that I interpreted * characters in the ALT VCF field. In summary: the spec is ambiguous and Octopus uses * as a base and GATK uses it as an symbolic allele. You can read a detailed discussion here. It looks like the powers that be will decide in favour of GATK-style, in which case I will be forced to change Octopus' representation.

DaGaMs commented 4 years ago

Holy moly, I see, you had a fun exchange with Heng about this. Anyway, it's a pretty frustrating situation for me right now. I went the brute-force way of filtering out all alt alleles that match \*\S :-/

dancooke commented 4 years ago

As of f99fec430d1ed694f59e05dc3c20a9f55998b4c1, Octopus conforms to the recently modified VCF spec - * is now 'symbolic' and will only appear on it's own as an ALT.