greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

"Flip labels" positive control experiments #36

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

Note: 04_coefficient_analysis.ipynb doesn't need to be reviewed, I'm just adding back changes that I (somehow) accidentally deleted in a previous PR.

In general, the goal of this PR is to "flip" or hold out a subset of positively labeled samples in a given gene and cancer type, and see if a model trained on the rest of the samples can differentiate the "flipped" samples (false negatives) from the true negatives.

For an idea of how I'm training/testing, I made this extremely high-tech and polished graphic:

cropped_training_image

(shaded samples have positive labels, the others have negative labels). What this is showing is that I'm removing the positively labeled samples from the test set, since including them as positives would inflate performance (training on test data) and including them as negatives (what we did at first) would artificially deflate performance.

Results are in 03_cross_cancer_classification/plot_flip_labels_results.ipynb.