mskcc / vcf2maf

Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
Other
373 stars 216 forks source link

vcf2maf fails when converting InDels @ GRCm38 #30

Closed sebastianlange closed 8 years ago

sebastianlange commented 8 years ago

This minimal vcf-file cannot be converted by vcf2maf 1.6.3 (the Indel-Site fails to be annotated), while VEP (online version and stand-alone) works fine:
11:96283450-96283451 deletion intron_variant, feature_truncation MODIFIER Hoxb8 ENSMUSG00000056648 T

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=REJECT,Description="Rejected as a confident somatic mutation">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=BQ,Number=A,Type=Float,Description="Average base quality for reads supporting alleles">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=FA,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal,0=wildtype,1=germline,2=somatic,3=LOH,4=post-transcriptional modification,5=unknown">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">
##INFO=<ID=VT,Number=1,Type=String,Description="Variant type, can be SNP, INS or DEL">
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=1,length=195471971>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##contig=<ID=Y,length=91744698>
##reference=file:///data1/misc/Genomes/GRCm38/GRCm38.fa
##INFO=<ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  64_2B
11  74388661    .   C   A   .   PASS    AC=1;AN=2;SF=38;SOMATIC;VT=SNP  GT:BQ:DP:FA:SS:AD   0/1:34:30:0.1:2:27,3
11  96283450    .   TG  T   .   PASS    AC=1;AN=2;END=96283451;HOMLEN=8;HOMSEQ=GGGGGGGG;SF=0,2;SVLEN=-1;SVTYPE=DEL  GT:AD   0/1:18,5 

Leading to this error:

~/Packages/vcf2maf-1.6.3$ perl vcf2maf.pl --vep-path ~/Packages/vep-v82/ --vep-data ~/Packages/.vep-v82/ --ref-fasta ~/Packages/.vep-v82/mus_musculus/82_GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa  --ncbi-build GRCm38 --species mus_musculus --input-vcf ~/Packages/vcf2maf-1.6.2/1.vcf --output-maf ~/Packages/vcf2maf-1.6.3/1.maf --tumor-id 64_2B --normal-id 64_2B_Normal
STATUS: Running VEP and writing to: /home/engleitner/Packages/vcf2maf-1.6.3/1.vep.vcf
2015-11-05 10:23:54 - Read existing cache info
2015-11-05 10:23:54 - Starting...
2015-11-05 10:23:54 - Detected format of input file as vcf
2015-11-05 10:23:54 - Read 2 variants into buffer
2015-11-05 10:23:54 - Calculating consequences

Use of uninitialized value in pattern match (m//) at /home/engleitner/Packages/vep-v82/Bio/EnsEMBL/Variation/Utils/VEP.pm line 1642.
2015-11-05 10:23:55 - Writing output
2015-11-05 10:23:55 - Processed 2 total variants (2 vars/sec, 2 vars/sec total)
2015-11-05 10:23:55 - Finished!

Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 697, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 698, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 699, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 700, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 701, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 702, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 703, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 704, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 705, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 706, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 707, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 708, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 709, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 710, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 711, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 712, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 713, <GEN1> line 45.
Use of uninitialized value in join or string at vcf2maf.pl line 689, <GEN1> line 45.
ckandoth commented 8 years ago

Thanks for reporting this. Let me take a look.

ckandoth commented 8 years ago

Sorry for putting this on the backburner. I tracked this down to a bug in VEP where it fails to report the Allele in the CSQ output, when you have SVTYPE=DEL specified in your INFO field. If you remove SVTYPE=DEL, it seems to work fine. Can you test this out and let me know? If you can confirm, then we'll report this to dev@ensembl.

ckandoth commented 8 years ago

Actually, it looks like more like a feature of VEP to annotate larger SVs (structural variants) differently. Allele is reported in CSQ, but it simply says deletion. It also skips reporting an HGVS notation of the variant and the exon/intron numbers. I have pushed a fixed vcf2maf.pl in the master branch that should handle this condition gracefully, but I'd recommend not defining SVTYPE in the INFO field, for small indels, so that VEP provides you with more granular information on them.

Thanks again for reporting this!