Currently, to build the training/testing dataset for our classifiers, we filter to cancer types that have at least 5% of samples mutated and at least 10 total samples mutated. One of our paper reviewers brought up the idea that instead of applying these filters to samples from each cancer type independently, we could instead just look at the whole dataset - or in other words, we could choose genes to train classifiers for based on their overall percent/count of mutated samples.
This PR implements this alternate gene-level filtering method and compares the results with our existing classifiers. Overall, the results are fairly similar in the sense that gene expression still seems to perform considerably better than the other data types (as we saw in #75 and #79), and in general performance tends to be worse with the gene-level filtering approach.
Here is a plot showing the difference for each gene between the "old" filtering scheme (per-cancer type) and the "new" filtering scheme (gene-level across all cancer types). A positive value for a gene means that gene's classifier performed better for the "old" filtering scheme and vice-versa for the "new" filtering scheme:
So we can see that most values are above 0 here, indicating that filtering for each cancer type independently tends to lead to better classifier performance for most genes. This seems to hold across all the data types we looked at.
Code changes:
01_explore_data/count_dataset_filters.ipynb tests some different mutation percentage/mutation count filters (I settled on 100 total mutations and 1% of samples mutated across all cancer types, which seems to be a decent tradeoff between sensitivity and specificity)
02_classify_mutations/compare_filtering.ipynb compares the "old" and "new" filtering schemes
mpmp/utilities/tcga_utilities.py has the implementation of the gene-level filtering
Currently, to build the training/testing dataset for our classifiers, we filter to cancer types that have at least 5% of samples mutated and at least 10 total samples mutated. One of our paper reviewers brought up the idea that instead of applying these filters to samples from each cancer type independently, we could instead just look at the whole dataset - or in other words, we could choose genes to train classifiers for based on their overall percent/count of mutated samples.
This PR implements this alternate gene-level filtering method and compares the results with our existing classifiers. Overall, the results are fairly similar in the sense that gene expression still seems to perform considerably better than the other data types (as we saw in #75 and #79), and in general performance tends to be worse with the gene-level filtering approach.
Here is a plot showing the difference for each gene between the "old" filtering scheme (per-cancer type) and the "new" filtering scheme (gene-level across all cancer types). A positive value for a gene means that gene's classifier performed better for the "old" filtering scheme and vice-versa for the "new" filtering scheme:
So we can see that most values are above 0 here, indicating that filtering for each cancer type independently tends to lead to better classifier performance for most genes. This seems to hold across all the data types we looked at.
Code changes:
01_explore_data/count_dataset_filters.ipynb
tests some different mutation percentage/mutation count filters (I settled on 100 total mutations and 1% of samples mutated across all cancer types, which seems to be a decent tradeoff between sensitivity and specificity)02_classify_mutations/compare_filtering.ipynb
compares the "old" and "new" filtering schemesmpmp/utilities/tcga_utilities.py
has the implementation of the gene-level filtering