The main goal of this PR is to build out regression (predicting continuous values) in addition to classification (predicting binary values) for drug response. I've applied it here to stratified cross-validation, as a way to make sure everything's working smoothly and we're generally doing better than our baselines, which we do seem to be.
Here's a plot comparing the true labels with shuffled labels across a few drugs, measuring performance using Spearman correlation between predictions and true labels:
So in most cases the blue boxes are considerably higher than the orange boxes, which is good! Next we'll try the same holdout experiments as before (liquid vs. solid, and single cancer type holdouts) using regression as well.
Major code changes:
Moved model training code out of pancancer_evaluation/utilities/classify_utilities.py and created pancancer_evaluation/prediction/classification.py and pancancer_evaluation/prediction/regression.py for classification and regression model fitting respectively
Add code to 08_cell_line_prediction/download_drug_data.ipynb and pancancer_evaluation/utilities/ccle_data_utilities.py to process/load continuous labels (we're using log(IC50) values here which are provided by GDSC, here's the Wiki page about IC50)
Plot results in 08_cell_line_prediction/plot_stratified_drug_regression.ipynb
The main goal of this PR is to build out regression (predicting continuous values) in addition to classification (predicting binary values) for drug response. I've applied it here to stratified cross-validation, as a way to make sure everything's working smoothly and we're generally doing better than our baselines, which we do seem to be.
Here's a plot comparing the true labels with shuffled labels across a few drugs, measuring performance using Spearman correlation between predictions and true labels:
So in most cases the blue boxes are considerably higher than the orange boxes, which is good! Next we'll try the same holdout experiments as before (liquid vs. solid, and single cancer type holdouts) using regression as well.
Major code changes:
pancancer_evaluation/utilities/classify_utilities.py
and createdpancancer_evaluation/prediction/classification.py
andpancancer_evaluation/prediction/regression.py
for classification and regression model fitting respectively08_cell_line_prediction/download_drug_data.ipynb
andpancancer_evaluation/utilities/ccle_data_utilities.py
to process/load continuous labels (we're using log(IC50) values here which are provided by GDSC, here's the Wiki page about IC50)08_cell_line_prediction/plot_stratified_drug_regression.ipynb