mskcc / vcf2maf

Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
Other
368 stars 215 forks source link

Could you un-pad suffix bps for complex indels? #100

Closed kk988 closed 7 years ago

kk988 commented 7 years ago

I am looking into this on complex indel: 5 112174757 . GAAGA G,GGA 3843.12 PASS AC=1,1;AF=0.031,0.031;AN=32;BaseQRankSum=0.216;Clippi...

In sample 1 the gt is 0/1 (AAGA deletion) and in 6 the gt is 0/2 (AA Deletion). When putting it through vcf2maf.pl sample 1 comes out correct, and sample 6 comes out mostly right, but not to MAF Specification.

APC 324 . GRCh37 5 112174758 112174761 + Frame_Shift_Del DEL AAGA AAGA GA sample6

Technically this should be 112174758 to 112174759 (I think). with alleles AA AA -.

Not an urgent matter, but whenever you can get around to it!

Thanks, Krista

ckandoth commented 7 years ago

Here's a test VCF that should cover your example in line 2, and a few more related situations:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TUMOR   NORMAL
5   112174757   .   GAAGA   G,GGA   .   .   .   GT:AD:DP    0/1:12,8,0:20   0/0:30,2,0:32
5   112174757   .   GAAGA   G,GGA   .   .   .   GT:AD:DP    0/2:12,0,8:20   0/0:30,0,2:32
5   112174757   .   GAAGA   G,GGA   .   .   .   GT:AD:DP    1/2:0,8,12:20   1/1:0,2,30:32
5   112174757   .   GAAGA   G,GGA   .   .   .   GT:AD:DP    0/1:12,8,0:20   0/2:30,0,2:32
5   112174757   .   GAAGA   G,GGA   .   .   .   GT:AD:DP    0/2:12,0,8:20   0/1:30,2,0:32

The genotypes of the last two lines are more likely to happen across patients, rather than in a tumor vs normal situation. But we should still handle it. The current vcf2maf v1.6.11 will produce these values:

Chromosome  Start_Position  End_Position    Reference_Allele    Tumor_Seq_Allele1   Tumor_Seq_Allele2   Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count
5   112174758   112174761   AAGA    AAGA    -   AAGA    AAGA    20  12  8   32  30  2
5   112174758   112174761   AAGA    AAGA    GA  AAGA    AAGA    20  12  8   32  30  2
5   112174758   112174761   AAGA    GA  -   -   -   20  0   8   32  0   2
5   112174758   112174761   AAGA    AAGA    -   AAGA    GA  20  12  8   32  30  0
5   112174758   112174761   AAGA    AAGA    GA  AAGA    -   20  12  8   32  30  0

All of these are acceptable, and not serious deviations from MAF specs. I'm pleasantly surprised that even the allele counts are working correctly, though they are really hard to cross-check against the VCF. MAF specs were not designed for anything beyond tumor vs normal reporting!

But as you explained, line 2 can be further normalized by un-padding the GA suffix, and decrementing the End_Position appropriately. vt normalize calls this "right trimming".

Left-trimming of all alleles in REF and ALT is already done by vcf2maf. I can add right-trimming of all alleles, but your example calls for a "conditional" right-trimming of only the alleles that will make it into the final MAF. And the reasons for this may depend not just on what's in GT.

I'll look at the code again with fresh eyes in a few weeks, before trying a solution, or deciding to give up. 😁 It's more complicated than I thought.

ckandoth commented 7 years ago

@kk988 Sorry for leaving this hanging for so long. I decided that this falls in the category of "partial support for multi-tumor VCFs". Ideally, a user must split multi-tumor VCFs into per-sample VCFs before running vcf2maf, as detailed here - https://www.biostars.org/p/108112/#108816

In somatic VCFs, rows with multiple ALT alleles are more likely from heterozygosity or microsatellite instability within a tumor. And the code in vcf2maf that chooses which allele to report in the MAF, operates under that assumption. So supporting multi-tumor VCFs will require an additional input file listing TN-pairs and a stricter reliance on GT:AD:DP from the genotype columns.

Such a redesign will also make it harder for basic users to understand and operate vcf2maf. Most users expect a single MAF line reported per line in the input VCF.