h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.94k stars 2k forks source link

implement shap-feature-importance #16385

Open wendycwong opened 2 months ago

wendycwong commented 2 months ago

We have shap-summary-plots but user wants to see the actual values. Here is the answer according to @tomasfryda

tomf Yesterday at 11:59 PM AFAIK we don’t have a method/function to do that. Usually mean absolute contribution is used for variable importance (https://christophm.github.io/interpretable-ml-book/shap.html#shap-feature-importance) but I don’t think there is just one correct way to do it. Also I would probably recommend shap summary plot instead as it shows more information without additional computation. The calculation itself is quite trivial: contr = model.predict_contributions(test)#, background_frame=train) feature_importances = dict(zip(contr.names, contr.abs().mean()))

import matplotlib.pyplot as plt fi = sorted(feature_importances.items(), key=lambda x: x[1]) plt.barh([x[0] for x in fi], [x[1] for x in fi]) plt.title("Feature Importances") plt.show() For tree models you don’t have to specify the background frame. Calculation with background frame is usually much slower (IIRC the number of operations is number of rows in background frame * number of operations without background frame). Generally, it’s recommended to use background_frame as the choice of background frame influences the results. The problem with not using background frame is that you don’t know how important individual splits in the trees are (e.g., if the model denies mortgage for people taller than 3m (~10 ft) the contributions calculated without background frame would consider this split as important as other splits but with background frame we would find out that there are no people that tall (or at least there is not many people like that) so the contribution would end up lower. Screenshot 2024-09-11 at 8.51.44.png

Screenshot 2024-09-11 at 8.51.44.png

christophm.github.io 9.6 SHAP (SHapley Additive exPlanations) | Interpretable Machine Learning Machine learning algorithms usually operate as black boxes and it is unclear how they derived a certain decision. This book is a guide for practitioners to make machine learning decisions interpretable.

Implement this for R and Python clients.