greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

CCLE mutation prediction #54

Closed jjc2718 closed 1 year ago

jjc2718 commented 1 year ago

PR description:

This PR implements mutation status prediction (sample has/does not have a mutation in the given gene) for cell line data from CCLE, using gene expression. In general this is harder than it was for TCGA because there are about an order of magnitude fewer cell lines than there were tumor samples in TCGA (see #51 for some analysis of cancer types and mutated proportion of cell lines in the CCLE data).

Likely because of this, mutation prediction doesn't work as well on CCLE data as it did for TCGA, both in terms of overall performance and in terms of effectiveness of our feature selection method. Here are the results for EGFR, a gene that worked well on TCGA as shown in #48:

EGFR_all_summary

EGFR_non_carcinoma_summary

So we can see that there's very little difference between pancan_f_test and median_f_test, and also that most of the delta_aupr values are close to 0 suggesting that our models aren't doing much better than one with randomly permuted labels.

Code changes:

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

review-notebook-app[bot] commented 1 year ago

View / edit / reply to this conversation on ReviewNB

ben-heil commented on 2022-09-09T21:18:53Z ----------------------------------------------------------------

It never ceases to amaze me how strong "use more genes" is as a feature selection technique. I know that e.g. median f test is better in some cases, but even in a dataset like this where everything performs roughly the same, going from 100 genes to 1000 doubles the mean AUPR.

jjc2718 commented on 2022-09-12T16:06:38Z ----------------------------------------------------------------

Yeah! I would expect it to taper off fairly quickly because there's so much redundancy/collinearity in gene expression, but it really doesn't, at least not here. Maybe I'll look at some learning curves for one or two genes to get an idea of when adding features stops improving performance for this problem.