In the preprocessing code for our classifiers, we filter out cancer types that don't contain at least 5% of samples mutated and at least 10 total samples mutated, for a given target gene.
We were curious how many total genes these filters would give us, if we look at all ~20,000 genes we have mutation data for. This script filters samples for each gene and counts the number of samples/cancer types that would be included in our classifiers.
In the preprocessing code for our classifiers, we filter out cancer types that don't contain at least 5% of samples mutated and at least 10 total samples mutated, for a given target gene.
We were curious how many total genes these filters would give us, if we look at all ~20,000 genes we have mutation data for. This script filters samples for each gene and counts the number of samples/cancer types that would be included in our classifiers.
There are quite a few valid genes:
(when CNV data is included)
(when we just look at point mutations)