CCLE mutation prediction

jjc2718 commented 1 year ago

PR description:

This PR implements mutation status prediction (sample has/does not have a mutation in the given gene) for cell line data from CCLE, using gene expression. In general this is harder than it was for TCGA because there are about an order of magnitude fewer cell lines than there were tumor samples in TCGA (see #51 for some analysis of cancer types and mutated proportion of cell lines in the CCLE data).

Likely because of this, mutation prediction doesn't work as well on CCLE data as it did for TCGA, both in terms of overall performance and in terms of effectiveness of our feature selection method. Here are the results for EGFR, a gene that worked well on TCGA as shown in #48:

EGFR_all_summary

EGFR_non_carcinoma_summary

So we can see that there's very little difference between pancan_f_test and median_f_test, and also that most of the delta_aupr values are close to 0 suggesting that our models aren't doing much better than one with randomly permuted labels.

Code changes:

08_cell_line_prediction/run_ccle_mutation_prediction.py to run experiments
pancancer_evaluation/data_models/ccle_data_model.py and pancancer_evaluation/utilities/ccle_data_utilities.py to load the CCLE data using the same preprocessing pipeline we used for the TCGA data
Some small edits to pancancer_evaluation/utilities/data_utilities.py to correctly load oncogene/TSG info
08_cell_line_prediction/plot_mutation_prediction_results.ipynb to visualize results

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

review-notebook-app[bot] commented 1 year ago

View / edit / reply to this conversation on ReviewNB

ben-heil commented on 2022-09-09T21:18:53Z ----------------------------------------------------------------

It never ceases to amaze me how strong "use more genes" is as a feature selection technique. I know that e.g. median f test is better in some cases, but even in a dataset like this where everything performs roughly the same, going from 100 genes to 1000 doubles the mean AUPR.

jjc2718 commented on 2022-09-12T16:06:38Z ----------------------------------------------------------------

Yeah! I would expect it to taper off fairly quickly because there's so much redundancy/collinearity in gene expression, but it really doesn't, at least not here. Maybe I'll look at some learning curves for one or two genes to get an idea of when adding features stops improving performance for this problem.

review-notebook-app[bot] commented 1 year ago

View / edit / reply to this conversation on ReviewNB

ben-heil commented on 2022-09-09T21:19:41Z ----------------------------------------------------------------

Can probably remove (or alternatively replace with interpretation)

jjc2718 commented on 2022-09-12T16:00:38Z ----------------------------------------------------------------

Good catch! I think I'll just remove it for now, I'm planning to make some tweaks to this in the coming weeks so any interpretation I add now is likely to change.

jjc2718 commented 1 year ago

Good catch! I think I'll just remove it for now, I'm planning to make some tweaks to this in the coming weeks so any interpretation I add now is likely to change.

View entire conversation on ReviewNB

jjc2718 commented 1 year ago

Yeah! I would expect it to taper off fairly quickly because there's so much redundancy/collinearity in gene expression, but it really doesn't, at least not here. Maybe I'll look at some learning curves for one or two genes to get an idea of when adding features stops improving performance for this problem.

View entire conversation on ReviewNB

greenelab / pancancer-evaluation

CCLE mutation prediction #54