greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Rerun mutation prediction + compare optimizers #71

Closed jjc2718 closed 1 year ago

jjc2718 commented 1 year ago

In #68, in addition to what I listed in the PR description, I also tried running MSI prediction with a different sklearn interface/optimizer. Generally we've been running most experiments using SGDClassifier, which optimizes the logistic loss using stochastic gradient descent. Instead I tried using LogisticRegression with an L1 penalty using the liblinear optimizer, which uses a coordinate descent algorithm that's supposed to converge quickly but can scale worse to datasets with many samples.

Since performance was generally better for MSI prediction with LogisticRegression, but not that much better, in this PR I reran the mutation prediction experiments from #65 using LogisticRegression, and compared the results between the two optimizers in the notebooks 02_cancer_type_classification/lasso_range_analysis/compare_optimizers_all.ipynb and 02_cancer_type_classification/lasso_range_analysis/compare_optimizers_gene.ipynb.

In general, it does seem like the liblinear optimizer results in a better fit for almost every gene:

image

In this plot, each sample is a gene/cancer type combination, and a positive value means liblinear performed better than sgd for the best-performing LASSO parameters using each optimizer.

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB