Comparing -omics types for survival prediction, part 1

jjc2718 commented 3 years ago

PR description:

As we've been working toward preprinting/submitting our paper, one idea a collaborator brought up was that it could strengthen our arguments to look at more than just mutation prediction. A common problem of interest in cancer genomics/multi-omics is predicting survival duration (sometimes referred to as prognosis prediction).

This is a bit more complicated than the standard classification and regression problems that we've been looking at due to the presence of "censored" data. In other words, when a patient has a death recorded you know how to label the sample, but when there's no death recorded, there's often no way to know if the patient is still alive, or if they just withdrew from the study or lost touch with their doctors. There's a nice explanation of challenges and modeling strategies in the scikit-survival documentation, which is the Python package we used.

In general, we used the same basic approach as we used for mutation prediction (comparing individual -omics types), but using elastic net Cox regression instead of logistic regression.

Results:

So far, we've only run this with the top 1000 features for the expression and methylation data types. We're planning to run this for all the data types using both raw and PCA compressed features; those results will come in the next PR, hopefully early next week.

In general, gene expression seems to outperform the methylation data types slightly, but predictive ability is fairly comparable between data types. (Red square = significantly outperforms permuted-labels baseline + statistically equivalent to best-performing data type)

Main code changes:

06_predict_survival/run_survival_prediction.py: script to run experiments
06_predict_survival/plot_survival_results.ipynb: script to plot results
mpmp/prediction/survival.py: main modeling/prediction code
load_survival_labels in mpmp/utilities/data_utilities.py: main data loading/preprocessing code

nrosed commented 3 years ago

Looks good! Few small comments:

Have you tried using cancer subtype/stage as a confounder? Survival is highly affected by stage at which the patient was diagnosed.
Not sure if you already inspected these, but for some of the models (high and low performing) it's sometimes useful to plot the survival curves, then color by expression of the top markers + other confounders? Just to make sure it's working and to get a sense of the variation across patients. It's a bit tricky when you have so many variables in the survival model, but it may also help to plot the expression of the top markers comparing between the patients in the longest and shortest surviving groups.
If you are having sensitivity to hyperparameters, you could try running your model on a very homogeneous subset of data, luminalA/luminalB BRCA is both large and more or less uniform. My thinking is that in this subset of samples you won't get small subgroups driven by a noisy set of genes or driven by confounders.

jjc2718 commented 3 years ago

Response to @nrosed comments:

Have you tried using cancer subtype/stage as a confounder? Survival is highly affected by stage at which the patient was diagnosed.

Not yet - our collaborator suggested including age as a covariate (along with our usual mutation burden and cancer type in the pan-cancer analysis), but including stage at diagnosis would also be a good idea.

Not sure if you already inspected these, but for some of the models (high and low performing) it's sometimes useful to plot the survival curves, then color by expression of the top markers + other confounders? Just to make sure it's working and to get a sense of the variation across patients. It's a bit tricky when you have so many variables in the survival model, but it may also help to plot the expression of the top markers comparing between the patients in the longest and shortest surviving groups.

This is a good idea! I'll try to plot a few survival curves with relevant covariates/markers for my next PR, as a sanity check. Looking at coefficients, we definitely see that age has a strong negative correlation with survival in just about every cancer type, so that seems to make sense and would be easy to visualize in a plot.

If you are having sensitivity to hyperparameters, you could try running your model on a very homogeneous subset of data, luminalA/luminalB BRCA is both large and more or less uniform. My thinking is that in this subset of samples you won't get small subgroups driven by a noisy set of genes or driven by confounders.

Good to know! I'll give this a try if we continue to see convergence/sensitivity issues.

jjc2718 commented 3 years ago

Response to @ben-heil general comment:

Looks good! I assume you don't have standard clinical data for the patients like age/sex that could be used to predict survival to see if 'omics data goes beyond that?

We do have some clinical data, and we're including age as a covariate on the recommendation of a collaborator. We haven't been including sex, but that may also be a good idea to try.

We did think about comparing to a clinical features-only baseline (which has proven surprisingly hard to beat in other survival prediction work, e.g. Figure 2 in this paper), but the focus of this work isn't so much to develop the best survival predictor that we possibly can, more just to compare the content of different -omics types against one another for this problem. Lots of other studies have benchmarked different -omics integration and clinical feature extraction methods for survival prediction, so we don't feel the need to get into that kind of comparison here.

greenelab / mpmp

Comparing -omics types for survival prediction, part 1 #59