manuel-calzolari / shapicant

Feature selection package based on SHAP and target permutation, for pandas and Spark
https://shapicant.readthedocs.io
MIT License
30 stars 4 forks source link

support for xgboost enable_categorical #3

Open cadama opened 1 year ago

cadama commented 1 year ago

Xgboost supports categorical features since 1.6 but I am stumbling into an error when using it in shapicant. Here is a minimal example

import pandas as pd
import numpy as np
from shapicant import PandasSelector
import shap
import xgboost as xgb

num_features = pd.DataFrame(np.random.random((100, 4)), columns=list(range(4)))
categoricals = pd.DataFrame(np.random.randint(1, 10, (100, 3)), dtype="category", columns=list(range(4, 7)))

X_train = pd.concat([num_features, categoricals], axis=1, )
X_test = X_train.copy()
y_train = np.random.random((100, ))
y_test = np.random.random((100, ))

params = {
        "colsample_bynode": (len(num_features) + len(categoricals)) ** .5 / (len(num_features) + len(categoricals)),
        "learning_rate": 1,
        "max_depth": 5,
        "num_boost_round": 1,
        "num_parallel_tree": 100,
        "objective": "reg:logistic",
        "subsample": 0.62,
        "enable_categorical": True,
        "tree_method": "hist", "booster": "gbtree",
        "eval_metric": ['logloss', 'rmse'], 'base_score': y_train.mean()
    }

model = xgb.XGBRFRegressor(**params, random_state=42)
model.fit(X_train, y_train)

# Use PandasSelector with 100 iterations
explainer_type = shap.TreeExplainer
selector = PandasSelector(model, explainer_type, n_iter=30, random_state=42)

selector.fit(
    X_train,
    y_train,
    X_validation=X_test,
    estimator_params={
        "eval_set": [(X_test, y_test)]
    },
)

Which results into

[10:01:13] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1667849614592/work/src/learner.cc:767: 
Parameters: { "num_boost_round" } are not used.
Computing true SHAP values:   0%|          | 0/30 [00:00<?, ?it/s][10:01:14] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1667849614592/work/src/learner.cc:767: 
Parameters: { "num_boost_round" } are not used.
[0] validation_0-logloss:0.71347    validation_0-rmse:0.29523
Computing true SHAP values:   0%|          | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-36-cb665d5c0d88>", line 36, in <module>
    selector.fit(
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shapicant/_pandas_selector.py", line 85, in fit
    true_pos_shap_values, true_neg_shap_values = self._get_shap_values(
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shapicant/_pandas_selector.py", line 199, in _get_shap_values
    explainer = self.explainer_type(self.estimator, **explainer_type_params or {})
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shap/explainers/_tree.py", line 149, in __init__
    self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shap/explainers/_tree.py", line 859, in __init__
    xgb_loader = XGBTreeModelLoader(self.original_model)
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shap/explainers/_tree.py", line 1431, in __init__
    self.buf = xgb_model.save_raw()
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/xgboost/core.py", line 2408, in save_raw
    _check_call(
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/xgboost/core.py", line 279, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [10:01:14] /Users/runner/miniforge3/conda-bld/xgboost-split_1667849614592/work/src/tree/tree_model.cc:869: Check failed: !HasCategoricalSplit(): Please use JSON/UBJSON for saving models with categorical splits.
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x000000017fb0ed98 dmlc::LogMessageFatal::~LogMessageFatal() + 124
  [bt] (1) 2   libxgboost.dylib                    0x000000017fccca40 xgboost::RegTree::Save(dmlc::Stream*) const + 1184
  [bt] (2) 3   libxgboost.dylib                    0x000000017fc102a4 xgboost::gbm::GBTreeModel::Save(dmlc::Stream*) const + 312
  [bt] (3) 4   libxgboost.dylib                    0x000000017fc1b390 xgboost::LearnerIO::SaveModel(dmlc::Stream*) const + 1224
  [bt] (4) 5   libxgboost.dylib                    0x000000017fb2eb2c XGBoosterSaveModelToBuffer + 788
  [bt] (5) 6   libffi.8.dylib                      0x00000001019e804c ffi_call_SYSV + 76
  [bt] (6) 7   libffi.8.dylib                      0x00000001019e57d4 ffi_call_int + 1336
  [bt] (7) 8   _ctypes.cpython-39-darwin.so        0x0000000101c8c544 _ctypes_callproc + 1324
  [bt] (8) 9   _ctypes.cpython-39-darwin.so        0x0000000101c86850 PyCFuncPtr_call + 1176

I am running xgboost==1.7.1 and shapicant==0.4.0

manuel-calzolari commented 1 year ago

This is due to a problem with the SHAP package, see https://github.com/slundberg/shap/issues/2662

It would be better to fix the issue in SHAP, otherwise I would have to develop a workaround in shapicant.