Predictions on samples from normal tissue

One of our reviewers asked about our model's predictions on the TCGA samples from healthy tissue, which are present for some cancer types but not others (see 01_explore_data/normal_tissue_samples.ipynb for some exploration of this).

This PR adds code to train and save a model on the whole (tumor) dataset for a given gene, and a notebook (07_train_final_classifiers/predict_normal.ipynb) to apply these trained models to the normal samples and compare their predictions to the predictions made on tumor samples.

For now, we just did this for the genes we looked at in the multi-omics analysis: TP53, KRAS, EGFR, SETD2, PIK3CA, IDH1. In most cases we do see that normal samples are predicted to have a low-ish probability of mutation. Here are the results for TP53 as an example:

So predictions for the normal samples are generally lower than they are for the mutated samples (true positives), but not as low as they are for the non-mutated tumor samples (true negatives). This makes some sense since the normal samples are "out of distribution" (they weren't present in the training set and probably don't exactly "look like" any of the tumor samples), so it makes sense that they have mutation probabilities closer to 0.5 than the "true" negative tumor samples that we fit the model on.

Code changes:

Analysis notebooks described/linked above
07_train_final_classifiers/train_classifier.py: script to train a classifier for a single gene/set of parameters on the entire dataset (no cross-validation), and to save the model to results_dir
mpmp/utilities/param_results_utilities.py: set of functions to get the "best" parameter choices from a set of results across different outer cross-validation folds
Changes to mpmp/prediction/cross_validation.py and mpmp/prediction/classification.py to train a single model on the whole dataset

greenelab / mpmp

Predictions on samples from normal tissue #86