vahuynh / GENIE3

Machine learning-based approach for the inference of gene regulatory networks from expression data.
68 stars 32 forks source link

Doubt on variance importance and (multi)collinearity #4

Closed MiqG closed 7 months ago

MiqG commented 3 years ago

Hi!

First, thank you for developing such a cool new concept for network inference in the omics!

After reading your paper, I was wondering whether the variable importances obtained could be confounded by having multicollinearity between genes like is explained here. Then, I understand that highly collinear features (genes) will be used for splitting observations close to the root a few times for each tree because they hold very similar information with respect to the target variable. And, therefore, these will have low importance when averaging over all the ensemble. Is this true with the current implementation of variable importance?

Thank you very much again!

Miquel

vahuynh commented 3 years ago

Hi Miquel,

Yes, this is an issue that you can have with the current implementation of variable importance. When several input genes are highly correlated, the information that they bring about the target gene will tend to spread across them, resulting in lower importance scores.

I have however never tried to correct this issue in the context of network inference.