mskcc / vcf2maf

Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
Other
375 stars 218 forks source link

maf2vcf.pl is off by one for insertions #118

Closed raylim closed 7 years ago

raylim commented 7 years ago

For example, the following maf input gives the wrong vcf output

Hugo_Symbol     Entrez_Gene_Id  Center  NCBI_Build      Chromosome      Start_Position  End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele        Tumor_Seq_Allele1       Tumor_Seq_Allele2       dbSNP_RS     dbSNP_Val_Status Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode     Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1        Tumor_Validation_Allele2        Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2 Verification_Status     Validation_Status       Mutation_Status Sequencing_Phase        Sequence_Source Validation_Method       Score   BAM_File        Sequencer       t_ref_count     t_alt_count     n_ref_count     n_alt_count  COMMENTS Consequence     HGVSc   HGVSp   HGVSp_Short     Transcript_ID   RefSeq  Protein_position        Codons  Hotspot ALLELE_NUM      PICK    UNIPARC n_depth MA:link.PDB     Feature CLIN_SIG        Gene    HGNC_ID ExAC_AF_AMR     t_depth       MA:FIS  cDNA_Change     DISTANCE        SYMBOL_SOURCE   amino_acid_change       Existing_variation      SYMBOL  ExAC_AF_SAS     VARIANT_CLASS   AA_MAF  HIGH_INF_POS    GENE_PHENO      ExAC_AF_AFR     ASN_MAF PHENO   BIOTYPE transcript    MA:link.var     AFR_MAF cdna_change     DOMAINS MOTIF_SCORE_CHANGE      Transcript      Amino_acids     EA_MAF  Allele  MA:FImpact      cDNA_position   ExAC_AF_NFE     MA:link.MSA     SIFT    INTRON  TREMBL  AMR_MAF EAS_MAF CANONICAL     IS_NEW  all_effects     ExAC_AF_EAS     MOTIF_NAME      GMAF    TSL     SOMATIC IMPACT  MOTIF_POS       CDS_position    Amino_Acid_Change       MA:protein.change       ExAC_AF_FIN     SWISSPROT       EUR_MAF Feature_type    HGVS_OFFSET   PolyPhen        FILTER  comments        ENSP    Comments        ExAC_AF CCDS    ExAC_AF_OTH     EXON    SAS_MAF Exon_Number     MINIMISED       PUBMED
INHA    0       MSKCC   GRCh37  2       220439701       220439702       +       Frame_Shift_Ins INS     -       -       CT                      MSK-VP-0041-IM-T        MSK-VP-0041-IM-N                                                     Unknown  SOMATIC                         MSK-IMPACT                      1021    721     551.0   0.0             frameshift_variant      ENST00000243786.2:c.554_555insCT        p.Leu186PhefsTer5       p.L186Ffs*5     ENST00000243786 NM_002191.3           gct/gcCTt       0.0     NEWRECORD

Resulting vcf entry:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=G,Type=Integer,Description="Allelic Depths of REF and ALT(s) in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  MSK-VP-0041-IM-T        MSK-VP-0041-IM-N
2       220439701       .       C       CCT     .       .       .       GT:AD:DP        0/1:1021,721:.  0/0:551,0:.

The vcf entry should be 2:220439700 G/GCT

ckandoth commented 7 years ago

Hi Ray. Position 2:220439701 in GRCh37 contains a C, and the MAF specifies that an additional CT has been inserted immediately after it. In VCF format, this can be represented as a C->CCT at position 220439701. Why do you think the VCF should say 2:220439700 G/GCT?

raylim commented 7 years ago

the maf says that positions 2:220439701-220439702 are CT. 2:220439700 is a G, hence 2:220439700 G/GCT

ckandoth commented 7 years ago

No, the MAF says that CT was inserted between positions 2:220439701 and 2:220439702.

ckandoth commented 7 years ago

Maybe the confusion is that 2:220439701-220439702 is also CT in the GRCh37 reference, but that's just a coincidence.

raylim commented 7 years ago

I've confirmed that the variant is G/GCT in a bam file for which it was called. I've attached the mpileup. The variant is actually at 2:220439700. mpileup.txt

raylim commented 7 years ago

igv_snapshot

ckandoth commented 7 years ago

Sounds good. Then the input MAF format is incorrect. Where did you get it from?

raylim commented 7 years ago

the maf is straight from msk-impact

ckandoth commented 7 years ago

The DMP pipeline for MSK-IMPACT doesn't generate MAFs - they create a flattened VCF-like file, which maf2maf and maf2vcf are able to operate on. Can you point me to the raw pipeline results? Somewhere on Luna?

raylim commented 7 years ago

on saba2: /home/limr/share/data/grail_cfdna_tidy/validation_prostate/rawdata/msk-impact/msk-impact/data_mutations_extended.txt

rhshah commented 7 years ago

my 2 cents, this file is generated by portal itself, so probably the conversion for the portal from dmp based vcf like format to maf format has issues.

ckandoth commented 7 years ago

Yup makes sense. I worked with @zheins in March to fix this - https://github.com/cBioPortal/genome-nexus-annotation-pipeline/pull/16 - you likely need to rerun the conversion.

zheins commented 7 years ago

@raylim @ckandoth The mskimpact maf has been corrected. The data was never refreshed after fixing the issue. I re-ran fetching/annotation on all samples with insertions. INHA 0 MSKCC GRCh37 2 220439700 220439701 + Frame_Shift_Ins INS - - CT

raylim commented 7 years ago

Thanks @zheins and @ckandoth for helping resolve this.