Closed kk988 closed 7 years ago
Here's a test VCF that should cover your example in line 2, and a few more related situations:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL
5 112174757 . GAAGA G,GGA . . . GT:AD:DP 0/1:12,8,0:20 0/0:30,2,0:32
5 112174757 . GAAGA G,GGA . . . GT:AD:DP 0/2:12,0,8:20 0/0:30,0,2:32
5 112174757 . GAAGA G,GGA . . . GT:AD:DP 1/2:0,8,12:20 1/1:0,2,30:32
5 112174757 . GAAGA G,GGA . . . GT:AD:DP 0/1:12,8,0:20 0/2:30,0,2:32
5 112174757 . GAAGA G,GGA . . . GT:AD:DP 0/2:12,0,8:20 0/1:30,2,0:32
The genotypes of the last two lines are more likely to happen across patients, rather than in a tumor vs normal situation. But we should still handle it. The current vcf2maf v1.6.11 will produce these values:
Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count
5 112174758 112174761 AAGA AAGA - AAGA AAGA 20 12 8 32 30 2
5 112174758 112174761 AAGA AAGA GA AAGA AAGA 20 12 8 32 30 2
5 112174758 112174761 AAGA GA - - - 20 0 8 32 0 2
5 112174758 112174761 AAGA AAGA - AAGA GA 20 12 8 32 30 0
5 112174758 112174761 AAGA AAGA GA AAGA - 20 12 8 32 30 0
All of these are acceptable, and not serious deviations from MAF specs. I'm pleasantly surprised that even the allele counts are working correctly, though they are really hard to cross-check against the VCF. MAF specs were not designed for anything beyond tumor vs normal reporting!
But as you explained, line 2 can be further normalized by un-padding the GA
suffix, and decrementing the End_Position
appropriately. vt normalize
calls this "right trimming".
Left-trimming of all alleles in REF
and ALT
is already done by vcf2maf. I can add right-trimming of all alleles, but your example calls for a "conditional" right-trimming of only the alleles that will make it into the final MAF. And the reasons for this may depend not just on what's in GT
.
I'll look at the code again with fresh eyes in a few weeks, before trying a solution, or deciding to give up. 😁 It's more complicated than I thought.
@kk988 Sorry for leaving this hanging for so long. I decided that this falls in the category of "partial support for multi-tumor VCFs". Ideally, a user must split multi-tumor VCFs into per-sample VCFs before running vcf2maf, as detailed here - https://www.biostars.org/p/108112/#108816
In somatic VCFs, rows with multiple ALT alleles are more likely from heterozygosity or microsatellite instability within a tumor. And the code in vcf2maf that chooses which allele to report in the MAF, operates under that assumption. So supporting multi-tumor VCFs will require an additional input file listing TN-pairs and a stricter reliance on GT:AD:DP
from the genotype columns.
Such a redesign will also make it harder for basic users to understand and operate vcf2maf. Most users expect a single MAF line reported per line in the input VCF.
I am looking into this on complex indel: 5 112174757 . GAAGA G,GGA 3843.12 PASS AC=1,1;AF=0.031,0.031;AN=32;BaseQRankSum=0.216;Clippi...
In sample 1 the gt is 0/1 (AAGA deletion) and in 6 the gt is 0/2 (AA Deletion). When putting it through vcf2maf.pl sample 1 comes out correct, and sample 6 comes out mostly right, but not to MAF Specification.
APC 324 . GRCh37 5 112174758 112174761 + Frame_Shift_Del DEL AAGA AAGA GA sample6
Technically this should be 112174758 to 112174759 (I think). with alleles AA AA -.
Not an urgent matter, but whenever you can get around to it!
Thanks, Krista