greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Fix for "disappearing genes" issue #48

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

This addresses an issue with certain genes that were present in the results for some seeds/data types, but not others. See #44 for some background and examples.

After exploring this, there ended up being a few separate issues, all of which are fixed in this PR:

The results of all these changes are that we end up with fewer valid genes total (85 for gene expression, 84 for methylation, 75 for all data types -> see 01_explore_data/missing_genes.ipynb), but these genes are now consistent across data types and across signal/shuffled models within a given experiment. The end result is that our conclusions should be more robust and less prone to spurious signal/small sample sizes.

Looking at the genes that get dropped in the new experiments, most of them seem to be low signal/hard to predict (near the origin in our volcano plots):

image

I'm currently re-running all of our experiments with these updated gene sets, which will be a separate future PR (although I don't expect it to change our conclusions).