microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.53k stars 3.82k forks source link

[SHAP] higher feature importance == higher weight? #1350

Closed germayneng closed 6 years ago

germayneng commented 6 years ago

Hi all,

thank you for making such an awesome module. It is my bread and butter. I do have a question, and it is not really a technical problem, just some clarification on the lgbm concept:

I understand that based off feature importance in lgbm, we will get the raw number of splits. Concept wise, let say variable A is the highest, which means it contributes more in the splitting and hence the prediction of the target variable. Variable A will of course be the highest node in a tree.

Does this mean that variable A will result in a higher gain / loss in actual probability as we move from the node of variable A and split when computing the target variable (let us treat this as a classification problem) as compared to variable B ? Or that variable A, although taking part in more splits, being higher in the tree does can also result in a smaller gain / loss in value as compared to variable B.

Does importance play a part in weight of value gain when we end up computing the target variable?

bbennett36 commented 6 years ago

Use SHAP - https://github.com/slundberg/shap

Then to answer your question -
I understand that based off feature importance in lgbm, we will get the raw number of splits. - This is only referring to the 'split' feature importance. This isn't really a good metric to use for feature importance. Features that are numeric will almost always have a higher split than binary features even if the binary feature is more 'important'. It just needs to look at different thresholds/ranges which means more splits.

Here is all of the feature importance metrics -
“split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

If you're not going to use SHAP and use the feature importance feature in LGB, I would just use gain. The feature importance metrics in LGB are not always consistent though and SHAP is. (I would read the paper if you need to know more in depth) We use SHAP exclusively for feature importance now at my work. I would highly recommend it. SHAP can show you how features actually affect the output which has been very helpful and seems like what you're looking for.

germayneng commented 6 years ago

@bbennett36 thankyou for giving this insight. I do have some questions:

1) how do i set the gain for the feature importance for lgb

2) SHAP is really good. However, it feels like LIME. It does the explanation for a particular instance or test set. As such, when you mention that you use it for feature importance, does it mean that you use SHAP to evaluate your predictions and from there, identify which feature impacts the prediction the most. == the most important feature.

Edit: is it this that you proposed? image

Laurae2 commented 6 years ago

LightGBM R and Python wrappers can predict feature importance and SHAP values. SHAP returns a matrix (per observation, per feature) you can analyze to get insight on the model predictions. "Most important feature" is very subjective and requires the user what to look for (biggest values might not mean better)

StrikerRUS commented 6 years ago

Since we are speaking here about SHAP, @slundberg do you have plans to add interactions into LightGBM?

slundberg commented 6 years ago

SHAP interaction values are getting added to the C++ implementation in the SHAP python package (used for sklearn models right now). There are also some large performance improvements that hopefully will be a part of that, such as sparse outputs and much faster runtime. The idea being that rather than reimplement things for each package, we can have a single common implementation. Once that lands, we can then just take in LightGBM models, or eventually copy the C++ into the LightGBM codebase to enable R support.

Also, to respond to the OP, I think a recent medium article of mine would be helpful given your question. It is about XGBoost but a similar story applies to LightGBM: https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27