Closed jjc2718 closed 2 years ago
Is there a reason that you'd expect the baseline performance to be better when shuffling within cancer type? I feel like I should see the intuition but it's escaping me
In our model we have a covariate/fixed effect for cancer type, so when we shuffle by cancer type the model can typically do fairly well just by predicting a positive label randomly with p = (frequency of mutation in the given cancer type). When we shuffle across cancer types, mutation frequencies in each cancer type will tend to end up around the pan-cancer mutation frequency of the given gene, so this doesn't work as well.
If you're planning on running all your scripts through slurm, you might be able to save yourself some effort with Snakemake or Nextflow. They'll handle slurm job dispatching for you and will save you from having to write the same gene list/conda init code over and over
Good to know! I probably won't get to doing this for this repo, but I'll think about trying a workflow manager for my next project. It would definitely simplify some of the boilerplate code.
Recently, we decided to change the way we're calculating the baseline that we're comparing our mutation prediction models against. Shuffling the labels separately for each cancer type makes the baseline perform slightly better, but it didn't end up affecting the comparison between data types too much.
Main code changes in this PR:
02_classify_mutations/explore_shuffled_cancer_type.ipynb
: notebook to compare results with old shuffling scheme against new results, shows that performance for most genes got worse as we expected02_classify_mutations
and in05_classify_mutations_multimodal
were just rerun with new data, these don't need reviewslurm_scripts
directories for scripts that we used to run things on a Slurm cluster (these are just modified versions of the existing Bash scripts, don't need extensive review)mpmp/prediction/cross_validation.py
to shuffle data independently for each cancer typeWe also made some changes to how survival prediction works, but those will be in a separate PR.