greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Start working on LASSO experiments across all genes #65

Closed jjc2718 closed 1 year ago

jjc2718 commented 1 year ago

Following on #64, this PR continues to build the infrastructure to run and analyze similar experiments with model size/LASSO penalty variation across multiple cancer genes. In this case, I ran the same experiments for TP53, EGFR, ATRX and CDKN2A (all common cancer drivers with diverse cancer types included in the training set).

The script at 02_cancer_type_classification/lasso_range_gene.ipynb breaks down performance across cancer types for a single gene. Here's an example for CDKN2A:

image

This shows that for most cancer types, the lower LASSO parameters (less regularization/more features) perform at least as well as the higher LASSO parameters (more regularization/fewer features), even when the test cancer type is held out of the training set.

By contrast, the script at 02_cancer_type_classification/lasso_range_all.ipynb breaks down performance across all the genes in the dataset:

image

This shows that across these 4 genes, there tends to be a positive correlation between the number of features in the model and generalization performance (i.e. in general, more features in the model => better generalization). The next step is to scale this up to the set of several hundred driver genes from my last paper, and find a clean way to analyze/visualize it (i.e. not a box plot for each of >100 genes...)

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB