greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Analysis of held-out cancer type classification results #21

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

In my previous PR (#20), I did an analysis of mutation prediction results from stratified cross-validation experiments; i.e. the training set was composed of the same cancer types in the same proportions as the test set. In this PR, I'm adding a similar analysis, but I'm now holding out single cancer types. I did two types of experiments:

  1. Train on 75% of data from a single cancer type, test on the other 25% of data from that cancer type
  2. Train on the same 75% of data + data from all other cancer types in TCGA with sufficiently many mutations in the given gene, and test on the same 25% of data as before

The plots show p-values from t-tests comparing cross-validation results from each of these setups and the negative control (shuffled labels), and between the two setups (more documentation in the analysis notebooks).

I'm planning to do some follow-up on these results in future PRs.