linkedin / FastTreeSHAP

Fast SHAP value computation for interpreting tree-based models
BSD 2-Clause "Simplified" License
512 stars 31 forks source link

Catboost bug #7

Closed nilslacroix closed 2 years ago

nilslacroix commented 2 years ago

Catboost produces a TreeEnsemble has no "num_nodes" error with this code. Btw do you support a background dataset parameter, like in shap for "interventional" vs "tree_path_dependent"? Because if your underlying code uses the "interventional" method this might be related to this bug: https://github.com/slundberg/shap/issues/2557

from catboost import CatBoostRegressor
import  fasttreeshap

X, y = shap.datasets.boston()

model = CatBoostRegressor(task_type="CPU",logging_level="Silent").fit(X, y)
explainer = fasttreeshap.TreeExplainer(model, algorithm="v2", n_jobs=-1)
shap_values = explainer(X)

# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[25])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [131], in <cell line: 13>()
     10 # explain the model's predictions using SHAP
     11 # (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)
     12 explainer = fasttreeshap.TreeExplainer(model, algorithm="v2", n_jobs=-1)
---> 13 shap_values = explainer(X)
     15 # visualize the first prediction's explanation
     16 shap.plots.waterfall(shap_values[25])

File ~\miniconda3\envs\Master\lib\site-packages\fasttreeshap\explainers\_tree.py:256, in Tree.__call__(self, X, y, interactions, check_additivity)
    253     feature_names = getattr(self, "data_feature_names", None)
    255 if not interactions:
--> 256     v = self.shap_values(X, y=y, from_call=True, check_additivity=check_additivity, approximate=self.approximate)
    257 else:
    258     assert not self.approximate, "Approximate computation not yet supported for interaction effects!"

File ~\miniconda3\envs\Master\lib\site-packages\fasttreeshap\explainers\_tree.py:379, in Tree.shap_values(self, X, y, tree_limit, approximate, check_additivity, from_call)
    376 algorithm = self.algorithm
    377 if algorithm == "v2":
    378     # check if memory constraint is satisfied (check Section Notes in README.md for justifications of memory check conditions in function _memory_check)
--> 379     memory_check_1, memory_check_2 = self._memory_check(X)
    380     if memory_check_1:
    381         algorithm = "v2_1"

File ~\miniconda3\envs\Master\lib\site-packages\fasttreeshap\explainers\_tree.py:483, in Tree._memory_check(self, X)
    482 def _memory_check(self, X):
--> 483     max_leaves = (max(self.model.num_nodes) + 1) / 2
    484     max_combinations = 2**self.model.max_depth
    485     phi_dim = X.shape[0] * (X.shape[1] + 1) * self.model.num_outputs

AttributeError: 'TreeEnsemble' object has no attribute 'num_nodes'
jlyang1990 commented 2 years ago

Fixed this issue by avoiding "memory check" for CatBoost, since CatBoost is not supported in the current version of fasttreeshap (mentioned in https://github.com/linkedin/FastTreeSHAP/issues/6).

fasttreeshap is built only for "tree_path_dependent". You may still run "interventional" in fasttreeshap anyway, but its performance should be the same as in shap. I would suggest you to post issues related to "interventional" directly in shap GitHub page.

nilslacroix commented 2 years ago

Is this also true For xgboost and lgbm? From my understanding in the Paper "tree path dependent" is the better Method For explaining model Performance and interventional is used to explain Relationships in the Data. Also "interventional" is a lot slower so wouldnt a fast tree shap Method make a lot of sense For it?

jlyang1990 commented 2 years ago

Yes. fasttreeshap accelerates the shap value computation for xgboost and lgbm only for "tree_path_dependent".

Thanks for your suggestion! It may make sense to accelerate "interventional" as well, however the algorithms used in "tree_path_dependent" and "interventional" are totally different. It is actually much harder to accelerate "interventional" (and I actually doubt the feasibility of accelerating "interventional" from algorithm side), and thus it is out of the scope of this package.