greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Stratified cross-validation experiments/visualizations #20

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

This implements a cross-validation scheme similar to what's used in the BioBombe paper. This will provide a baseline for comparing the results when single cancer types are held out, so that we know what works when all cancer types are held out (we would expect to see some overlap here, but the differences will be interesting).

I tried this for two gene sets: the top 50 most mutated genes in TCGA (this is what BioBombe uses), and a set of known oncogenes/tumor suppressors from this paper (data on oncogene/TSG status comes from here: https://github.com/greenelab/pancancer/blob/master/data/vogelstein_cancergenes.tsv).

Also added statistical testing to compare results against our negative control (shuffling the true mutation labels), and added some code in 4_plot_results.ipynb to visualize the results.

jjc2718 commented 3 years ago

Responses to your questions:

Stratifying by cancer type certainly makes sense in terms of machine learning. Will you also stratify by cancer type in your pan-cancer classifier?

IMO, the stratification that I'm doing here is mostly useful in terms of evaluating the model (i.e. making the test set "look like" the training set). In the current experiments I'm interested in how this differs when the test set is only a single cancer type (forthcoming PRs), as opposed to an equal mix of cancer types like I have here.

I think it would also be interesting to see how different proportions of each cancer type in the training set affects model performance (which I think is what you're alluding to with your questions). For instance, right now some cancer types have many more samples than others, and you could imagine trying to balance these proportions in the training set in various ways. I think this is a slightly different question that I'm not planning to address in the current set of experiments, but I'm definitely interested in it for the future!

Do you think modulating the cancer proportions you use based on mutation signature similarity to your target cancer would help?

This is a neat idea! I'll definitely keep this in mind as well.