Add mutation data as features/predictors for survival

I'm not really sure of what the mutation data looks like, so I don't know how to interpret the results. If it an array of "is this gene mutated in this cancer type", or a VCF, or a list of common mutations and their rates, or something else?

The features are just a binary matrix of "is this gene mutated in this sample". These come from a MAF file generated by the TCGA project. This script here is pretty close to what I've been using to generate the labels.

Like I'd expect knowing whether e.g. TP53 is mutated to be a very strong signal for survival but maybe gene expression/methylation has all that information?

I'm not actually sure about TP53 - its mutation status is definitely a very strong signal for whether a patient has cancer or not, but conditioned on having cancer (which all of these samples do) I'm not actually sure how it affects survival.

There are some genes where the relationship is clearer, e.g. IDH mutation in gliomas which is generally a pretty strong positive prognostic predictor, or ERBB2 mutation in breast cancer which is a strong positive prognostic predictor because there are effective targeted therapies for it.

I do think a lot of the information is redundant, though, and that seems to be supported by these results (that the -omics types are generally as good or better than knowing the actual mutation status, at least for pan-cancer survival prediction).

greenelab / mpmp

Add mutation data as features/predictors for survival #69