Closed jjc2718 closed 2 years ago
Did you find that all non-Vogelstein genes are more predictive using expression? Or just that there are some that are compared to using Vogelstein?
We found that proportionally more genes are better predicted using expression in the larger gene set than in the Vogelstein gene set alone. It's definitely not all of the new genes that are better predicted with expression, but a lot of them seem to be.
Do you think it its how the oncogenes are defined in the different datasets?
I don't think so - we only use oncogene/TSG info to determine whether/which kind of CNV info to use to determine mutated samples. So even if we get the wrong annotation, it would just add a small amount of noise into our labels.
In #79, we saw that when we switch from using the Vogelstein et al. cancer genes to a larger set of cancer-associated genes, relative performance between data types shifts toward favoring gene expression over other data layers.
We wanted to explore why this might be the case by looking at GO functional enrichment of the different datasets. This is implemented in
01_explore_data/explore_cancer_gene_sets.ipynb
, and the results are summarized at the end of that script.TL;DR: nothing jumps out at me as being a super obvious difference between the datasets, but the non-Vogelstein cancer genes might be enriched for transcription factors/genes involved in transcription, which could be reflected in the usefulness of the gene expression data for the added genes. The newly added genes also seem to include slightly more cell cycle regulators and genes involved in DNA binding.