Really this should probably be two separate PRs -- the added analysis in 01_stratified_classification/nbconverted/plot_univariate_f_dists.py and the changes in 02_cancer_type_classification -- sorry in advance.
The main changes in this PR are the modifications to 02_cancer_type_classification/run_cancer_type_classification.py, allowing us to test the same feature selection methods described in #47 when entire cancer types are held out of the training set, and the analysis of those results is in 02_cancer_type_classification/plot_univariate_fs_results.ipynb.
We're planning to look across more genes in the future, but for TP53 and EGFR, the results are pretty close to what we expected: for cancer types that are not very related to the training set, selecting by median correlation across cancer types (i.e. selecting features that are "generally" predictive) improves generalization performance vs. selecting by aggregate pan-cancer correlation (which can choose features that are "specifically" predictive or driven strongly by one/a few cancer types).
Here's an example: compare the "train pan-cancer" (test cancer present in training set) and "train all other cancers" (test cancer left out of training set) for EGFR. We can see that pancan_f_test (green box) performance decreases for the latter, and median_f_test (red box) is more resilient to the test cancer type being dropped out.
PIK3CA, however, doesn't really follow this pattern:
Really this should probably be two separate PRs -- the added analysis in
01_stratified_classification/nbconverted/plot_univariate_f_dists.py
and the changes in02_cancer_type_classification
-- sorry in advance.The main changes in this PR are the modifications to
02_cancer_type_classification/run_cancer_type_classification.py
, allowing us to test the same feature selection methods described in #47 when entire cancer types are held out of the training set, and the analysis of those results is in02_cancer_type_classification/plot_univariate_fs_results.ipynb
.We're planning to look across more genes in the future, but for TP53 and EGFR, the results are pretty close to what we expected: for cancer types that are not very related to the training set, selecting by median correlation across cancer types (i.e. selecting features that are "generally" predictive) improves generalization performance vs. selecting by aggregate pan-cancer correlation (which can choose features that are "specifically" predictive or driven strongly by one/a few cancer types).
Here's an example: compare the "train pan-cancer" (test cancer present in training set) and "train all other cancers" (test cancer left out of training set) for EGFR. We can see that
pancan_f_test
(green box) performance decreases for the latter, andmedian_f_test
(red box) is more resilient to the test cancer type being dropped out.PIK3CA, however, doesn't really follow this pattern: