Analysis of held-out cancer type classification results

In my previous PR (#20), I did an analysis of mutation prediction results from stratified cross-validation experiments; i.e. the training set was composed of the same cancer types in the same proportions as the test set. In this PR, I'm adding a similar analysis, but I'm now holding out single cancer types. I did two types of experiments:

Train on 75% of data from a single cancer type, test on the other 25% of data from that cancer type
Train on the same 75% of data + data from all other cancer types in TCGA with sufficiently many mutations in the given gene, and test on the same 25% of data as before

The plots show p-values from t-tests comparing cross-validation results from each of these setups and the negative control (shuffled labels), and between the two setups (more documentation in the analysis notebooks).

I'm planning to do some follow-up on these results in future PRs.

greenelab / pancancer-evaluation

Analysis of held-out cancer type classification results #21