Closed jjc2718 closed 3 years ago
I'm not really sure of what the mutation data looks like, so I don't know how to interpret the results. If it an array of "is this gene mutated in this cancer type", or a VCF, or a list of common mutations and their rates, or something else?
The features are just a binary matrix of "is this gene mutated in this sample". These come from a MAF file generated by the TCGA project. This script here is pretty close to what I've been using to generate the labels.
Like I'd expect knowing whether e.g. TP53 is mutated to be a very strong signal for survival but maybe gene expression/methylation has all that information?
I'm not actually sure about TP53 - its mutation status is definitely a very strong signal for whether a patient has cancer or not, but conditioned on having cancer (which all of these samples do) I'm not actually sure how it affects survival.
There are some genes where the relationship is clearer, e.g. IDH mutation in gliomas which is generally a pretty strong positive prognostic predictor, or ERBB2 mutation in breast cancer which is a strong positive prognostic predictor because there are effective targeted therapies for it.
I do think a lot of the information is redundant, though, and that seems to be supported by these results (that the -omics types are generally as good or better than knowing the actual mutation status, at least for pan-cancer survival prediction).
We wanted to try using the true somatic mutation data for each sample as predictive features for survival, in order to compare with the predictive ability of the other -omics types (just expression and methylation for now).
For the pan-cancer dataset, this seems to perform pretty comparably to the baseline (cancer type, age, mutation burden), and worse than expression and methylation:
For some individual cancer types, though, the mutation data sems to work a bit better (e.g. LGG, LAML, ACC):