greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Univariate feature selection, held-out cancer types #48

Closed jjc2718 closed 1 year ago

jjc2718 commented 1 year ago

Really this should probably be two separate PRs -- the added analysis in 01_stratified_classification/nbconverted/plot_univariate_f_dists.py and the changes in 02_cancer_type_classification -- sorry in advance.

The main changes in this PR are the modifications to 02_cancer_type_classification/run_cancer_type_classification.py, allowing us to test the same feature selection methods described in #47 when entire cancer types are held out of the training set, and the analysis of those results is in 02_cancer_type_classification/plot_univariate_fs_results.ipynb.

We're planning to look across more genes in the future, but for TP53 and EGFR, the results are pretty close to what we expected: for cancer types that are not very related to the training set, selecting by median correlation across cancer types (i.e. selecting features that are "generally" predictive) improves generalization performance vs. selecting by aggregate pan-cancer correlation (which can choose features that are "specifically" predictive or driven strongly by one/a few cancer types).

Here's an example: compare the "train pan-cancer" (test cancer present in training set) and "train all other cancers" (test cancer left out of training set) for EGFR. We can see that pancan_f_test (green box) performance decreases for the latter, and median_f_test (red box) is more resilient to the test cancer type being dropped out.

image

PIK3CA, however, doesn't really follow this pattern:

image

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB