greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Cross-cancer prediction experiments #33

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

Sorry for the fairly large PR - it's not quite as big as it looks. A few files have just been moved and don't need to be re-reviewed:

Files in 04_cross_cancer_classification are new.

Major changes:

jjc2718 commented 3 years ago

Thanks for the feedback! Responses below:

I'm definitely not an expert on how to organize repos..but if you wanted to be consistent, I can see adding another dir for "data processing" that will include your 00 and 01 notebooks. Maybe 05 should go into a dir for "evaluating trained model" (not exactly sure of the name). This setup does remind me of Greg's Biobombe!

That makes sense! I think Greg's repos have separate directories for everything (no top-level notebooks), which seems like what you're suggesting. I'll take your suggestion for the data processing scripts, and I think I'll move the 05 notebook in the future (we'll probably have other analyses that will make sense to group it with).

For 04 ipynb:

  1. Just to clarify (again), so you've trained a model using gene expression data (training set) to predict if gene X in cancer A is mutated. Then you use this model to test if gene X is mutated in cancer A (using test set). You also use this model to test if gene X is mutated in cancer B

Yep, this sounds exactly right (we also test if gene Y is mutated in cancer A or B, where Y != X).

  1. For this line # get rows that have the same gene (possibly different identifiers) I feel like I would say "possibly different cancer types" instead. I guess that is what you mean to say, so maybe something to add to give a bit more specificity

Good catch! I'll change this.

  1. For your conclusion, when you say we were hoping to identify new relationships between genes in certain cancer types, but it doesn't look like we're able to do that here, at least for these genes) are you saying that you would've expected that your model trained on BRAF THCA would tell you something about BRAF SKCM? Or are you talking about within cancer types? Maybe both?

Yeah, we were hoping to see both. Mainly we were hoping that (1) genes trained on a gene in a cancer type can predict that gene well in other cancer types, and (2) models for a given gene can predict related genes well (e.g. KRAS -> NRAS).

We see evidence of (1) for TP53 but not really for other genes, and no strong evidence of any examples of (2) for the single-cancer experiments.

  1. Any known relationship between KRAS and BRAF? (just curious)

Yep, they're in the same pathway and are thought to have similar oncogenic effects in some cancers, although this isn't well understood and is almost certainly cancer type specific. In Greg's Ras pathway paper he showed that pan-cancer classifiers trained to detect Ras mutations often had higher scores for BRAF mutations in data from cancer cell lines, which suggests that mutations in Ras genes and BRAF could have similar functional effects in some contexts.

We don't see this effect super strongly here, but maybe weakly in a few cases (BRAF_THCA is somewhat well predicted by the pan-cancer KRAS model, for instance).

  1. As a followup experiment, does it make sense to train a pancancer model on multiple genes and then look at how the model tests? So looking at how information from different more genes can help as opposed to more cancer types? Could pick genes from the same module vs different modules...?

Yeah, definitely. We just need to think about how to do this. It might make sense to train a model on the union of two genes (i.e. the positively labeled samples are ones with a mutation in either KRAS or BRAF, or something like that), and see how well this performs on a test set. I just need to spend some time thinking through the experimental design, but it's definitely a possible next step.

  1. Is the reason that the pancancer model trained on TP53 or KRAS because these are the most commonly mutated gene across cancer types? I guess I'm wondering about the correlation between mutation frequency across cancers and these cross-cancer results.

Great point, this is something I've been thinking about as well. I do think the ubiquity of TP53 mutations is a large part of what makes it easy to train good classifiers on, since we have many positively labeled samples to learn from in most cancer types. I'll definitely try to think about how to better understand the effects of label imbalance in this data moving forward.