Comparing -omics types for survival prediction, part 2

jjc2718 commented 3 years ago

Problem description and previous changes in #59. Sorry this is a fairly large PR, no rush to get it reviewed so take your time.

Main code changes:

I changed how we're extracting PCA features a bit for various data types: main changes are in mpmp/utilities/data_utilities.py and mpmp/utilities/tcga_utilities.py. Basically, this checks if PCA features have already been generated for a given data type, and if not we calculate and save them to a .tsv file.
Code to generate survival curves in mpmp/prediction/cross_validation.py and mpmp/prediction/survival.py.
Notebooks to visualize survival curves (06_predict_survival/plot_survival_curves.ipynb) and to compare PCA dimension (06_predict_survival/plot_survival_pc_comparison.ipynb)
Updated data download scripts in 00_download_data to use the same PCA compressed features as the classification scripts; these probably don't need to be reviewed extensively

PR description:

This should wrap up the changes for survival prediction, which we are planning on including in our paper (manuscript edits to come). Across data types, we see better model stability and more reasonable results when we use the top principal components as features, and we decided to use PCA-derived features for all data types to make per-feature information content comparable.

Results:

We decided to stick with PCA compressed features for the paper, and we show results for several different numbers of principal components, in addition to age and mutation burden as covariates in all the models. In general, we see that expression and methylation are fairly comparable for pan-cancer survival prediction (similar to our mutation prediction results). We also see that RPPA data performs fairly well, with miRNA and mutational signatures performing relatively poorly.

We also looked at some survival curves, as suggested by @nrosed in #59, in the plot_survival_curves.ipynb notebook. For some cancer types (BRCA, LGG) predicted survival trended closely with age and subtype, and for others (LUAD) there wasn't much of a trend. You can see a few examples in the first few slides of this presentation: https://docs.google.com/presentation/d/1Kh6QUTadjk9RSQDXIIvfHF50a9cZb8uYmmqgwfeaUIg/edit?usp=sharing

nrosed commented 3 years ago

Looks good! It's probably not necessary to fix since it's only sanity checking, but I find the QC-survival plots a bit difficult to interpret with one line per sample. It may be helpful to plot the survival curves of the learned models with the confidence intervals instead of each sample. Then for your survival plots, you will have the same number of lines as the number of conditions (age quantile, stage, etc.). It is also helpful to include the calculated hazards ratio and a p-value. It's also nice to have tables like the one shown below to help the reader make sure the numbers per subgroup are reasonable and the p-value is reflective of a true effect.

Screen Shot 2021-09-07 at 11 10 01

Also, is the reason why LGG survival is well predicted because the subtypes are so easily captured in each of the -omic data types?

jjc2718 commented 3 years ago

Looks good! It's probably not necessary to fix since it's only sanity checking, but I find the QC-survival plots a bit difficult to interpret with one line per sample. It may be helpful to plot the survival curves of the learned models with the confidence intervals instead of each sample. Then for your survival plots, you will have the same number of lines as the number of conditions (age quantile, stage, etc.). It is also helpful to include the calculated hazards ratio and a p-value. It's also nice to have tables like the one shown below to help the reader make sure the numbers per subgroup are reasonable and the p-value is reflective of a true effect.

Thanks! This is super helpful - we're currently not planning to use the survival curves in the paper, so while I think a figure like this would be nice, in the interest of getting the paper out soon I don't think I want to spend too much time on it now.

It is totally possible that a reviewer will ask us to show more detailed survival curves, though, so it's good to have this example in case we need to break things down more quantitatively in an actual paper figure.

Also, is the reason why LGG survival is well predicted because the subtypes are so easily captured in each of the -omic data types?

That would be my guess. In the mutation prediction experiments we saw that predicting IDH1 status was very easy, particularly in glioma (AUROC values close to 1), so my guess is that the subtypes are captured well in the -omics data, and like you can see in the slides I linked, the IDH1 status-based subtypes are also very closely associated with survival.

greenelab / mpmp

Comparing -omics types for survival prediction, part 2 #62