Liquid vs. solid cancer experiments

jjc2718 commented 1 year ago

Following on from #56, we wanted to try holding out some cancer types and testing generalization ability for drug response prediction. To do this, we classified each cancer type as "liquid" (blood cancers e.g. leukemia, lymphoma, myeloma) and "solid" (everything else), then held out each of these groups, training on either the same cancer types or the opposite (i.e. generalization from liquid -> solid and vice-versa).

We tried this for a few different drugs, and it seems to be a pretty difficult problem. In most cases, we see that the feature selection methods don't outperform selecting the same number of random features:

The only markedly different case is cisplatin, where the median f-test selection method does seem to perform well:

Strangely, this is the opposite of what we were expecting: cisplatin is a general chemotherapy/antineoplastic drug (kills actively proliferating cells indiscriminately) as opposed to some of the other drugs we looked at (e.g. erlotinib, trametinib) which are targeted toward specific cancer mutations. We were expecting better performance for the targeted therapies since we'd expect to see identifiable gene expression changes associated with the driver mutations they target, although I guess it's possible that there's an "actively proliferating" gene expression signal our classifiers are picking up on.

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jjc2718 commented 1 year ago

Maybe Cisplatin is more predictable because broken DNA damage repair pathways have stronger signals than others? But yeah, I'd expect targeted therapies to be more predictable too.

Are there any drugs with mechanisms of action that are well enough known that you can manually select the relevant genes?

Yeah, it's possible that it's just a DDR-related signal! I think there are some drugs where the mechanisms of action are well known, but what's actually downstream of the gene that's affected/inhibited isn't that well known so it's not totally clear which genes will have changes in expression in sensitive/responsive cell lines. I know there are "regulon" databases that we could use to look at what genes are downstream of EGFR or BRAF or whatever gene we choose, but I'm still not convinced that would outperform just including a bunch of genes and letting the model regularization figure it out, at least based on our past experiments. Gene expression is complicated (as you know)!

greenelab / pancancer-evaluation

Liquid vs. solid cancer experiments #57