scikit-learn-contrib / boruta_py

Python implementations of the Boruta all-relevant feature selection method.
BSD 3-Clause "New" or "Revised" License
1.5k stars 256 forks source link

Question: High Collinearity, how does Boruta handle? #44

Open GinoWoz1 opened 5 years ago

GinoWoz1 commented 5 years ago

Hello,

Thanks for the package, I found it quite interesting.

When there are variables that are highly correlated, could that effect the Z-scores?

The only reason why I ask is in the past I have seen groups of highly correlated variables where the variables within that group have varied widely in their importance.

Would it make sense to handle the col-linearity problem before running Boruta?

Sincerely, G

danielhomola commented 5 years ago

https://stats.stackexchange.com/questions/94130/does-boruta-feature-selection-in-r-take-into-account-the-correlation-between-v

GinoWoz1 commented 5 years ago

Thanks @danielhomola , appreciate taking the time to reply. That answer helps to explain that boruta will always bring up the most important predictors. My question was regarding the robustness of the z-score estimate of the null distribution vs the regular variables; sorry I wasnt clear.

In my case, I have seen random forests with a certain set of features where there are, for example, 2 features that are highly correlated and one of those is # 2 in feature importance while the other feature is dead last. If I removed one other variable that wasn't one of the highly correlated variables, the importance scores would shift around - more so for the highly correlated variables.

My interest is in whether the z- score of the noise distribution would pick up the phenomena above and not count out the groups of highly correlated variables due to the sometimes noisy feature importance scores. I had seen the issue with feature importance scores before and how they could be unreliable at times and found the article at the bottom which helped to explain it a little.

I am still a little ignorant on the boruta method although I have read the paper. Just trying to get a better intuition for how it works and the interaction effects of the random forest feature importance scores ( I realize you can also use other estimators). Thanks for your patience.

Article describing the finickiness of feature importance scores at times

http://explained.ai/rf-importance/index.html