greenelab / mpmp

Multimodal Pan-cancer Mutation Prediction
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

PCA variance explained + elastic net nonzero feature distributions #19

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

Changes in this PR:

This answers some of the questions that @nrosed brought up during my practice committee meeting talk last week:

Results:

The trend in variance explained looks essentially the same across the data types: 100 PCs explain around 60-70% of variance, 1000 PCs explain around 80-85% of variance, and 5000 PCs captures around 95-98% of variance. I wouldn't expect the extra 5% of variance to be particularly predictive of most mutations, so taking 5000 PCs as our maximum should be fine (at least I think so).

Here's an example for the 450K methylation data - you can see the other data types in the notebooks.

image

In the nonzero coefficient analysis, the gene expression models tend to have more nonzero coefficients than the 27k methylation models, despite starting with fewer features (~15,000 for gene expression vs. ~20,000 for methylation). This is particularly true if we filter to examples where the target gene is well-predicted.

image

nrosed commented 3 years ago

Variance plots looks good, it matches what I would expect for expression data. For the nonzero coefficient analysis, what is the null hypothesis? Also is this analysis for a single cancer type? Is this for all mutations?

jjc2718 commented 3 years ago

Ahh, good questions - answers below:

is this analysis for a single cancer type? Is this for all mutations?

This analysis is for models that are stratified by cancer type (train/test sets have equal proportions of all cancer types). I haven't looked at this in the case where we're holding out a single cancer type yet, that might be an interesting next step.

The distribution combines models for all genes in the Vogelstein & Kinzler gene set (gene list here, original reference here).

For the nonzero coefficient analysis, what is the null hypothesis?

In addition to looking at all genes together, I wanted to see if there was a difference in the number of nonzero coefficients for genes where we can train a "good" predictor vs. genes where we can't. The way I've been distinguishing between these cases in some of my past experiments is by running cross-validation a few times (2 replicates x 4 folds), then doing a t-test to compare with a model where the labels were shuffled. So, the null hypothesis is that the distribution of cross-validation results (AUPR values) is the same with the true mutation labels as with shuffled mutation labels, and the alternative hypothesis is that it's different (in this case, the model with true labels is always better, although I am using a two-tailed test).

So looking at the results, it does look to me like there's a bit of a difference: for genes that we can predict "well", elastic net generally seems to select more features than it does for genes where there's little or no predictive signal. This isn't too surprising (at least to me), mostly just a sanity check.

nrosed commented 3 years ago

Aha! Ok, cool I get it now. Yes it makes sense to me, but I don't have a strong intuition why a predictive model would have more features than a non-predictive one. Also, how are you transforming your expression/ methylation data? Do they use the same transform, just curious for myself.

jjc2718 commented 3 years ago

We've been standardizing expression data (TCGA provides RPKM values, then we standardize them for each gene independently), and currently we're just using the methylation array beta values provided by TCGA with no standardization/normalization.

We tried standardizing the methylation beta values too in some previous experiments, and it didn't seem to make any difference as far as the results, so we stuck with the simpler approach.