greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Model coefficient stability analysis #31

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

The goal of this analysis was to use the model coefficients of our mutation prediction classifiers to evaluate similarity between models. Since we're using elastic net logistic regression (which zeroes out coefficients for most genes), we can compare the nonzero coefficients between models, and if they are similar we say the models are similar.

The idea was to eventually use this to define similarities for the same gene across different cancer types (e.g. if we noticed that our KRAS mutation predictor selects similar genes in thyroid cancer and colon cancer, we would hypothesize that KRAS mutations have similar effects on gene expression in those cancer types, which could be interesting biologically).

Unfortunately, this doesn't work as well as we thought it would - even for models on different cross-validation folds of the same gene and cancer type, we see considerable variation in the nonzero coefficients. This is probably due to the large amount of multicollinearity in gene expression data: in many cases there are multiple predictors/genes in the dataset conveying essentially the same information, so the model can pick one or a few of them essentially arbitrarily.

This is a fairly well-documented characteristic of feature selection in linear models on datasets with collinear features, so it isn't too surprising.