COSMIC CGC gene set analysis, part 1

PR description:

For our paper in review, both reviewers were interested in seeing results for a larger set of cancer genes (greenelab/mpmp-manuscript#43), with one of them pointing out the COSMIC Cancer Gene Census and its larger set of DNA damage repair genes compared to the Vogelstein et al. gene set we've been using.

This PR lays the foundation for a more in-depth analysis of this gene set. So far I've only run the "all data types" comparison for these genes, and I'll likely rerun all of our analyses in a future PR and combine this gene set with the Vogelstein genes based on the results here.

With the Vogelstein gene set, which is considerably smaller, we saw generally similar performance for expression and methylation. When we expand to the COSMIC gene set, this seems to favor gene expression:

And counting the number of "well-predicted" genes at a p-value cutoff of 0.001 (these numbers were pretty similar between data types for the Vogelstein genes):

Code changes:

Added 01_explore_data/explore_cosmic_gene_set.ipynb to download and explore the COSMIC CGC genes
Added code to load oncogene/TSG information for COSMIC genes and run our scripts on them
Visualized results in 02_classify_mutations/plot_all_results.ipynb (worked with minimal changes)
Added some things that I've been using on the cluster that aren't tracked in git (02_classify_mutations/scripts/slurm_scripts/run_drop_target_vogelstein.sbatch, mpmp/scripts/check_complete.py) - these don't really need to be reviewed closely

greenelab / mpmp

COSMIC CGC gene set analysis, part 1 #75