manuel-calzolari / shapicant

Feature selection package based on SHAP and target permutation, for pandas and Spark
https://shapicant.readthedocs.io
MIT License
30 stars 4 forks source link

All-relevant vs Minimum-optimal feature selection? #4

Closed rictoo closed 1 year ago

rictoo commented 1 year ago

Hi! I was just wondering whether shapicant aims to perform All-relevant feature selection (as per Boruta, e.g.) or Minimum-optimal feature selection (as per mRMR, e.g.)? I'm referring to the distinction described here.

Thanks!

manuel-calzolari commented 1 year ago

This is a good question. I would say it is basically an all-relevant method, but it also depends on the base estimator used. Take for example the worst case of correlation between two features where one feature is duplicated. Some algorithms (e.g. LightGBM if I remember correctly), when they have to decide on which feature to perform a split to minimize the loss, if they find two features that would reduce the loss equally, they simply use for the split the first one that is provided. So if you use a model that does not subsample columns and always considers all columns, one of the highly correlated (duplicate in the worst case) features will never be selected because it will always be "covered" by the other. Conversely, if the model uses subsampling of columns, there may be cases where these two features will not be evaluated at the same time and thus both may be given some importance. However, this depends on the implementation of the base model (if I remember correctly, sklearn estimators do not give precedence to the first feature that is provided, so they behave differently from LightGBM). So in conclusion, it is basically an all-relevant method, but depending on the base estimator used it may in some cases not select some features that are extremely correlated with others.

rictoo commented 1 year ago

Thank you so very much for your informative response, that clears things up and makes a ton of sense!