Per discussion between @md and me and feedback from @cband:
Currently, the final somatic MAF contains 279 columns. These are not all necessary, and a few could be omitted or collapsed into single columns in order to minimize file size and make it easier to navigate what's important. These can be done inside the pipeline (https://github.com/mskcc/vaporware/blob/develop/containers/vcf2maf/filter-somatic-maf.R and the corresponding germline filter script). Some columns that are output by default in VEP/vcf2maf are pretty much useless.
Here are some suggested changes:
Evaluate which default columns carry no meaning, see for example Sequence_Source, Validation_Method, Score, Tumor_Sample_UUID, Matched_Norm_Sample_UUID.
Prune the gnomAD/ExAC columns added by VEP/vcf2maf since we're doing this annotation ourselves in the pipeline.
From above, columns 141-173 in the current iteration are all columns from the pipeline annotation with gnomAD allele frequencies and counts. This already has a column with the individual subpopulation maxium (non_cancer_AF_popmax/non_cancer_AC_popmax) as well the overall non-cancer population (non_cancer_AF/non_cancer_AC)
Similarly, the raw counts be collapsed into single comma/colon/semi-colon separated columns.
Facets clonality annotation can be collapsed into fewer columns.
Possibly true for the neoantigen prediction annotation too, although I'm not too familiar with it.
Some columns that are added by the hotspot annotation can be removed.
Keep in mind:
The "official" MAF file spec sheet (https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format) has changed after the MC3 initiative. It's not necessary, in my opinion, to keep all of these. There is no one, single valid MAF format.
The MAF files look different in the somatic and germline setting (hotspot and OncoKB annotation, plus the gnomAD population annotation has a slightly separate meaning in this context). Not all of my suggestions above are equally applicable to both.
Similarly, they look different for exomes vs. genome (only in gnomAD annotation).
Per discussion between @md and me and feedback from @cband:
Currently, the final somatic MAF contains 279 columns. These are not all necessary, and a few could be omitted or collapsed into single columns in order to minimize file size and make it easier to navigate what's important. These can be done inside the pipeline (https://github.com/mskcc/vaporware/blob/develop/containers/vcf2maf/filter-somatic-maf.R and the corresponding germline filter script). Some columns that are output by default in VEP/vcf2maf are pretty much useless.
Here are some suggested changes:
Sequence_Source
,Validation_Method
,Score
,Tumor_Sample_UUID
,Matched_Norm_Sample_UUID
.non_cancer_AF_popmax
/non_cancer_AC_popmax
) as well the overall non-cancer population (non_cancer_AF
/non_cancer_AC
)Keep in mind: