greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Univariate correlation comparison #46

Closed jjc2718 closed 1 year ago

jjc2718 commented 2 years ago

Following on from #45, we've been thinking a bit more about feature selection, particularly in pan-cancer models or models that integrate data from various tissues/sources.

In this PR, I wanted to look at whether correlations between gene expression and mutation status are primarily driven by a strong correlation in a single cancer type, or by weak correlations across all cancer types. I'm curious if the genes that our classifiers are selecting fall mostly into one bucket or the other, or if it doesn't seem to matter much.

The analysis is implemented in 06_correlation_analysis.ipynb. In general, we see that highly correlated pan-cancer genes could fall into either category - some are mostly driven by a single highly correlated cancer type, and some have two or more cancer types that the correlation seems to be spread across.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB