greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Shuffle labels for baseline in each cancer type independently #65

Closed jjc2718 closed 2 years ago

jjc2718 commented 2 years ago

Recently, we decided to change the way we're calculating the baseline that we're comparing our mutation prediction models against. Shuffling the labels separately for each cancer type makes the baseline perform slightly better, but it didn't end up affecting the comparison between data types too much.

Main code changes in this PR:

We also made some changes to how survival prediction works, but those will be in a separate PR.

jjc2718 commented 2 years ago

Is there a reason that you'd expect the baseline performance to be better when shuffling within cancer type? I feel like I should see the intuition but it's escaping me

In our model we have a covariate/fixed effect for cancer type, so when we shuffle by cancer type the model can typically do fairly well just by predicting a positive label randomly with p = (frequency of mutation in the given cancer type). When we shuffle across cancer types, mutation frequencies in each cancer type will tend to end up around the pan-cancer mutation frequency of the given gene, so this doesn't work as well.

If you're planning on running all your scripts through slurm, you might be able to save yourself some effort with Snakemake or Nextflow. They'll handle slurm job dispatching for you and will save you from having to write the same gene list/conda init code over and over

Good to know! I probably won't get to doing this for this repo, but I'll think about trying a workflow manager for my next project. It would definitely simplify some of the boilerplate code.