Shuffle labels for baseline in each cancer type independently

greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction

BSD 3-Clause "New" or "Revised" License

7 stars 6 forks source link

Recently, we decided to change the way we're calculating the baseline that we're comparing our mutation prediction models against. Shuffling the labels separately for each cancer type makes the baseline perform slightly better, but it didn't end up affecting the comparison between data types too much.

Main code changes in this PR:

02_classify_mutations/explore_shuffled_cancer_type.ipynb: notebook to compare results with old shuffling scheme against new results, shows that performance for most genes got worse as we expected
Other notebooks in 02_classify_mutations and in 05_classify_mutations_multimodal were just rerun with new data, these don't need review
Added slurm_scripts directories for scripts that we used to run things on a Slurm cluster (these are just modified versions of the existing Bash scripts, don't need extensive review)
Added code to mpmp/prediction/cross_validation.py to shuffle data independently for each cancer type

We also made some changes to how survival prediction works, but those will be in a separate PR.

Is there a reason that you'd expect the baseline performance to be better when shuffling within cancer type? I feel like I should see the intuition but it's escaping me

In our model we have a covariate/fixed effect for cancer type, so when we shuffle by cancer type the model can typically do fairly well just by predicting a positive label randomly with p = (frequency of mutation in the given cancer type). When we shuffle across cancer types, mutation frequencies in each cancer type will tend to end up around the pan-cancer mutation frequency of the given gene, so this doesn't work as well.

If you're planning on running all your scripts through slurm, you might be able to save yourself some effort with Snakemake or Nextflow. They'll handle slurm job dispatching for you and will save you from having to write the same gene list/conda init code over and over

Good to know! I probably won't get to doing this for this repo, but I'll think about trying a workflow manager for my next project. It would definitely simplify some of the boilerplate code.

greenelab / mpmp

Shuffle labels for baseline in each cancer type independently #65