Closed jjc2718 closed 3 years ago
Looks good! It's probably not necessary to fix since it's only sanity checking, but I find the QC-survival plots a bit difficult to interpret with one line per sample. It may be helpful to plot the survival curves of the learned models with the confidence intervals instead of each sample. Then for your survival plots, you will have the same number of lines as the number of conditions (age quantile, stage, etc.). It is also helpful to include the calculated hazards ratio and a p-value. It's also nice to have tables like the one shown below to help the reader make sure the numbers per subgroup are reasonable and the p-value is reflective of a true effect.
Also, is the reason why LGG survival is well predicted because the subtypes are so easily captured in each of the -omic data types?
Looks good! It's probably not necessary to fix since it's only sanity checking, but I find the QC-survival plots a bit difficult to interpret with one line per sample. It may be helpful to plot the survival curves of the learned models with the confidence intervals instead of each sample. Then for your survival plots, you will have the same number of lines as the number of conditions (age quantile, stage, etc.). It is also helpful to include the calculated hazards ratio and a p-value. It's also nice to have tables like the one shown below to help the reader make sure the numbers per subgroup are reasonable and the p-value is reflective of a true effect.
Thanks! This is super helpful - we're currently not planning to use the survival curves in the paper, so while I think a figure like this would be nice, in the interest of getting the paper out soon I don't think I want to spend too much time on it now.
It is totally possible that a reviewer will ask us to show more detailed survival curves, though, so it's good to have this example in case we need to break things down more quantitatively in an actual paper figure.
Also, is the reason why LGG survival is well predicted because the subtypes are so easily captured in each of the -omic data types?
That would be my guess. In the mutation prediction experiments we saw that predicting IDH1 status was very easy, particularly in glioma (AUROC values close to 1), so my guess is that the subtypes are captured well in the -omics data, and like you can see in the slides I linked, the IDH1 status-based subtypes are also very closely associated with survival.
Problem description and previous changes in #59. Sorry this is a fairly large PR, no rush to get it reviewed so take your time.
Main code changes:
mpmp/utilities/data_utilities.py
andmpmp/utilities/tcga_utilities.py
. Basically, this checks if PCA features have already been generated for a given data type, and if not we calculate and save them to a .tsv file.mpmp/prediction/cross_validation.py
andmpmp/prediction/survival.py
.06_predict_survival/plot_survival_curves.ipynb
) and to compare PCA dimension (06_predict_survival/plot_survival_pc_comparison.ipynb
)00_download_data
to use the same PCA compressed features as the classification scripts; these probably don't need to be reviewed extensivelyPR description:
This should wrap up the changes for survival prediction, which we are planning on including in our paper (manuscript edits to come). Across data types, we see better model stability and more reasonable results when we use the top principal components as features, and we decided to use PCA-derived features for all data types to make per-feature information content comparable.
Results:
We decided to stick with PCA compressed features for the paper, and we show results for several different numbers of principal components, in addition to age and mutation burden as covariates in all the models. In general, we see that expression and methylation are fairly comparable for pan-cancer survival prediction (similar to our mutation prediction results). We also see that RPPA data performs fairly well, with miRNA and mutational signatures performing relatively poorly.
We also looked at some survival curves, as suggested by @nrosed in #59, in the
plot_survival_curves.ipynb
notebook. For some cancer types (BRCA, LGG) predicted survival trended closely with age and subtype, and for others (LUAD) there wasn't much of a trend. You can see a few examples in the first few slides of this presentation: https://docs.google.com/presentation/d/1Kh6QUTadjk9RSQDXIIvfHF50a9cZb8uYmmqgwfeaUIg/edit?usp=sharing