greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Gene-level filtering vs. per-cancer type filtering #85

Closed jjc2718 closed 2 years ago

jjc2718 commented 2 years ago

Currently, to build the training/testing dataset for our classifiers, we filter to cancer types that have at least 5% of samples mutated and at least 10 total samples mutated. One of our paper reviewers brought up the idea that instead of applying these filters to samples from each cancer type independently, we could instead just look at the whole dataset - or in other words, we could choose genes to train classifiers for based on their overall percent/count of mutated samples.

This PR implements this alternate gene-level filtering method and compares the results with our existing classifiers. Overall, the results are fairly similar in the sense that gene expression still seems to perform considerably better than the other data types (as we saw in #75 and #79), and in general performance tends to be worse with the gene-level filtering approach.

Here is a plot showing the difference for each gene between the "old" filtering scheme (per-cancer type) and the "new" filtering scheme (gene-level across all cancer types). A positive value for a gene means that gene's classifier performed better for the "old" filtering scheme and vice-versa for the "new" filtering scheme:

image

So we can see that most values are above 0 here, indicating that filtering for each cancer type independently tends to lead to better classifier performance for most genes. This seems to hold across all the data types we looked at.

Code changes: