greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Predictions on samples from normal tissue #86

Closed jjc2718 closed 2 years ago

jjc2718 commented 2 years ago

One of our reviewers asked about our model's predictions on the TCGA samples from healthy tissue, which are present for some cancer types but not others (see 01_explore_data/normal_tissue_samples.ipynb for some exploration of this).

This PR adds code to train and save a model on the whole (tumor) dataset for a given gene, and a notebook (07_train_final_classifiers/predict_normal.ipynb) to apply these trained models to the normal samples and compare their predictions to the predictions made on tumor samples.

For now, we just did this for the genes we looked at in the multi-omics analysis: TP53, KRAS, EGFR, SETD2, PIK3CA, IDH1. In most cases we do see that normal samples are predicted to have a low-ish probability of mutation. Here are the results for TP53 as an example:

image

So predictions for the normal samples are generally lower than they are for the mutated samples (true positives), but not as low as they are for the non-mutated tumor samples (true negatives). This makes some sense since the normal samples are "out of distribution" (they weren't present in the training set and probably don't exactly "look like" any of the tumor samples), so it makes sense that they have mutation probabilities closer to 0.5 than the "true" negative tumor samples that we fit the model on.

Code changes:

jjc2718 commented 2 years ago

I'm not sure of the context of the reviewer comment, but if your goal is to say "the normal samples are out of distribution", then running a quick UMAP or PCA to show that the normal samples live in a different world may support your case

This is a good idea! I'll do this as part of my next PR.