Following on from #45, we've been thinking a bit more about feature selection, particularly in pan-cancer models or models that integrate data from various tissues/sources.
In this PR, I wanted to look at whether correlations between gene expression and mutation status are primarily driven by a strong correlation in a single cancer type, or by weak correlations across all cancer types. I'm curious if the genes that our classifiers are selecting fall mostly into one bucket or the other, or if it doesn't seem to matter much.
The analysis is implemented in 06_correlation_analysis.ipynb. In general, we see that highly correlated pan-cancer genes could fall into either category - some are mostly driven by a single highly correlated cancer type, and some have two or more cancer types that the correlation seems to be spread across.
Following on from #45, we've been thinking a bit more about feature selection, particularly in pan-cancer models or models that integrate data from various tissues/sources.
In this PR, I wanted to look at whether correlations between gene expression and mutation status are primarily driven by a strong correlation in a single cancer type, or by weak correlations across all cancer types. I'm curious if the genes that our classifiers are selecting fall mostly into one bucket or the other, or if it doesn't seem to matter much.
The analysis is implemented in
06_correlation_analysis.ipynb
. In general, we see that highly correlated pan-cancer genes could fall into either category - some are mostly driven by a single highly correlated cancer type, and some have two or more cancer types that the correlation seems to be spread across.