greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

Add script to count valid samples/cancer types #76

Closed jjc2718 closed 2 years ago

jjc2718 commented 2 years ago

In the preprocessing code for our classifiers, we filter out cancer types that don't contain at least 5% of samples mutated and at least 10 total samples mutated, for a given target gene.

We were curious how many total genes these filters would give us, if we look at all ~20,000 genes we have mutation data for. This script filters samples for each gene and counts the number of samples/cancer types that would be included in our classifiers.

There are quite a few valid genes:

Screen Shot 2022-02-23 at 12 55 59 PM

(when CNV data is included)

Screen Shot 2022-02-23 at 12 58 10 PM

(when we just look at point mutations)