mskcc / cmo

Command-line tools for data analysts at the CMO
GNU General Public License v2.0
7 stars 13 forks source link

cmo.util.normalize_vcf crashes on duplicates #80

Closed kpjonsson closed 6 years ago

kpjonsson commented 6 years ago

Example error message: Duplicate alleles at 14:30047569; run with -cw to turn the error into warning or with -cs to fix. The -cs flag fixes it.

ckandoth commented 6 years ago

This appears to happen at non-variant positions, where the REF and ALT are the same, which is an issue with Vardict that we will investigate separately:

$ grep -P "^14\t30047569" /ifs/res/pi/Proj_06208_C.09ea1e76-1802-11e8-af40-645106efb11c/vcf/s_C_000224_T001_d.Group3.rg.md.abra.printreads.s_C_000224_N001_d.Group3.rg.md.abra.printreads.vardict.vcf
14      30047569        .       G       G       49      PASS    STATUS=StrongSomatic;SAMPLE=s_C_000224_T001_d;TYPE=Complex;SHIFT3=0;MSI=9.000;MSILEN=1;SSF=0.09573;SOR=Inf;LSEQ=TGTTGATAAGATCAATGGCT;RSEQ=AAAAAAATTACCAGTAAAAA     GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM        0/1:28:3:0,3:1,24:25,3:0.1071:1,0:34.7:1:31.2:1:1:0:60:6:0.1111:0:1     0/0:32:0:0,0:3,28:31,0:0:2,0:30.5:1:27.8:1:1:0:60:9.333:1:0:1.2

When run through bcftools norm, the event above errors out as you describe:

$ bcftools norm -f /ifs/depot/pi/resources/genomes/GRCh37/fasta/b37.fasta -m +any -o test.vcf /ifs/res/pi/Proj_06208_C.09ea1e76-1802-11e8-af40-645106efb11c/vcf/s_C_000224_T001_d.Group3.rg.md.abra.printreads.s_C_000224_N001_d.Group3.rg.md.abra.printreads.vardict.vcf
Duplicate alleles at 14:30047569; run with -cw to turn the error into warning or with -cs to fix.

For now, I have solved this in cmo.util.normalize_vcf by running bcftools norm with option --check-ref s to fix that line as follows:

14      30047569        .       G       .       49      PASS    STATUS=StrongSomatic;SAMPLE=s_C_000224_T001_d;TYPE=Complex;SHIFT3=0;MSI=9;MSILEN=1;SSF=0.09573;SOR=inf;LSEQ=TGTTGATAAGATCAATGGCT;RSEQ=AAAAAAATTACCAGTAAAAA GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM        0/0:28:3:0,3:1,24:25,3:0.1071:1,0:34.7:1:31.2:1:1:0:60:6:0.1111:0:1     0/0:32:0:0,0:3,28:31,0:0:2,0:30.5:1:27.8:1:1:0:60:9.333:1:0:1.2

Let me know if that works. VarDict reports this event as StrongSomatic, but there's nothing we can do if we don't know what the variant is. We'll see if we are running Vardict correctly, before we report this issue to the authors.

kpjonsson commented 6 years ago

Sounds good to me.