greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Drug response regression, cancer type holdout experiments #59

Closed jjc2718 closed 1 year ago

jjc2718 commented 1 year ago

Similar to #57, we wanted to try holding out some cancer types entirely, either liquid vs. solid or individual cancer types in CCLE, using regression instead of classification. Results are somewhat similar in the sense that there's not much separation between feature selection methods and selecting random features seems hard to beat, although performance over the baseline is probably somewhat better suggesting that regression rather than classification is probably the way to go.

When we look at individual cancer types we often see pretty high variability between CV folds. For example, here are results for docetaxel response prediction:

image

For e.g. cervical cancer or sarcoma, the error bars reach from -1 (perfect negative correlation between predicted/true labels) to 1 (perfect positive correlation between predicted and true labels). My guess is that this is because there are only a small handful of samples (maybe 2 or 3) in some of the test sets, so Spearman correlations of -1 or 1 are pretty likely if you guess labels at random, whereas for the cancer types with more cell lines (breast, colon, lung etc) we see much less variance.

As a next step we're thinking about filtering to cancer types with a certain number of samples (maybe 10 or 15) assayed for the given drug, which may cut down on some of the variance/randomness although I doubt it will drastically change results.

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

ben-heil commented 1 year ago

Ahhh well today I learned something about R^2