greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Optimizations for cross-cancer experiment scripts #34

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

No new results in this PR, just changes to how I'm running some of the experiments. I've made sure that this PR doesn't change the results of the experiments at all, but things are substantially faster now.

Essentially, the cross-cancer experiments I was running before were filling in entries in a matrix by training a new model for each matrix entry. I realized this didn't make sense - since each row of the matrix trains a model on the same gene/cancer type, we can just train the model once per row, then evaluate it lots of times to fill in that row (prediction/model evaluation is much faster than training).

This is a bit more complicated for the pan-cancer experiments, since we actually need to train different models for some matrix columns depending on which cancer type is held out, but it's possible to cache pre-trained models to help in this case as well.

There are more details about the actual optimizations in the code comments, if you're interested.