Closed jjc2718 closed 2 years ago
I'm not sure of the context of the reviewer comment, but if your goal is to say "the normal samples are out of distribution", then running a quick UMAP or PCA to show that the normal samples live in a different world may support your case
This is a good idea! I'll do this as part of my next PR.
One of our reviewers asked about our model's predictions on the TCGA samples from healthy tissue, which are present for some cancer types but not others (see
01_explore_data/normal_tissue_samples.ipynb
for some exploration of this).This PR adds code to train and save a model on the whole (tumor) dataset for a given gene, and a notebook (
07_train_final_classifiers/predict_normal.ipynb
) to apply these trained models to the normal samples and compare their predictions to the predictions made on tumor samples.For now, we just did this for the genes we looked at in the multi-omics analysis: TP53, KRAS, EGFR, SETD2, PIK3CA, IDH1. In most cases we do see that normal samples are predicted to have a low-ish probability of mutation. Here are the results for TP53 as an example:
So predictions for the normal samples are generally lower than they are for the mutated samples (true positives), but not as low as they are for the non-mutated tumor samples (true negatives). This makes some sense since the normal samples are "out of distribution" (they weren't present in the training set and probably don't exactly "look like" any of the tumor samples), so it makes sense that they have mutation probabilities closer to 0.5 than the "true" negative tumor samples that we fit the model on.
Code changes:
07_train_final_classifiers/train_classifier.py
: script to train a classifier for a single gene/set of parameters on the entire dataset (no cross-validation), and to save the model toresults_dir
mpmp/utilities/param_results_utilities.py
: set of functions to get the "best" parameter choices from a set of results across different outer cross-validation foldsmpmp/prediction/cross_validation.py
andmpmp/prediction/classification.py
to train a single model on the whole dataset