predict-idlab / powershap

A power-full Shapley feature selection method.
Other
200 stars 19 forks source link

Error when handling categoricals with LightGBM #23

Closed deepandas11 closed 2 years ago

deepandas11 commented 2 years ago

I'm facing a weird error when using a LightGBM model as the underlying model with the selector. I could find a simple repro using the titanic dataset:

X - bug_features.csv y - bug_label.csv

Categorical features and Data types info

cats --> ['Cabin', 'Embarked', 'Gender', 'Name', 'Parch', 'Pclass', 'SibSp', 'Ticket'],

X.dtypes -->
 Age             float64
 Cabin          category
 Embarked       category
 Fare            float64
 Gender         category
 Name           category
 Parch          category
 PassengerId       int64
 Pclass         category
 SibSp          category
 Ticket         category

Code snippet used to fit the selector:

lgb_clf = LGBMClassifier(random_state=42, n_estimators=10)
sel = PowerShap(model=lgb_clf, stratify=True, fit_kwargs={'categorical_feature':cats})
selector = PowerShap(
    model=lgb_clf,
    stratify=True,
    verbose=False,
    automatic=True
)
selector.fit(X, y)
Traceback ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [19], in () 5 _stratify = True 6 selector = PowerShap( 7 model=lgb_clf, 8 stratify=_stratify, 9 verbose=False, 10 automatic=True 11 ) ---> 12 selector.fit(X, y, categorical_feature=cats) File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/powershap.py:393, in PowerShap.fit(self, X, y, stratify, groups, **kwargs) 387 loop_its = 10 388 self._print( 389 "Automatic mode enabled: Finding the minimal required powershap", 390 f"iterations for significance of {self.power_alpha}.", 391 ) --> 393 shaps_df = self._explainer.explain( 394 X=X, 395 y=y, 396 loop_its=loop_its, 397 val_size=self.val_size, 398 stratify=stratify, 399 groups=groups, 400 cv_split=self.cv, # pass the wrapped cv split function 401 show_progress=self.show_progress, 402 **kwargs, 403 ) 405 processed_shaps_df = powerSHAP_statistical_analysis( 406 shaps_df, 407 self.power_alpha, 408 self.power_req_iterations, 409 include_all=self.include_all, 410 ) 412 if self.automatic: File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/shap_wrappers/shap_explainer.py:175, in ShapExplainer.explain(self, X, y, loop_its, val_size, stratify, groups, cv_split, random_seed_start, show_progress, **kwargs) 172 Y_train = y[np.sort(train_idx)] 173 Y_val = y[np.sort(val_idx)] --> 175 Shap_values = self._fit_get_shap( 176 X_train=X_train, 177 Y_train=Y_train, 178 X_val=X_val, 179 Y_val=Y_val, 180 random_seed=i + random_seed_start, 181 **kwargs, 182 ) 184 Shap_values = np.abs(Shap_values) 186 if len(np.shape(Shap_values)) > 2: File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/shap_wrappers/shap_explainer.py:251, in LGBMExplainer._fit_get_shap(self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs) 248 from copy import copy 250 PowerShap_model = copy(self.model).set_params(random_seed=random_seed) --> 251 PowerShap_model.fit(X_train, Y_train, eval_set=(X_val, Y_val)) 252 # Calculate the shap values 253 C_explainer = shap.TreeExplainer(PowerShap_model) File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/sklearn.py:967, in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model) 964 else: 965 valid_sets[i] = (valid_x, self._le.transform(valid_y)) --> 967 super().fit(X, _y, sample_weight=sample_weight, init_score=init_score, eval_set=valid_sets, 968 eval_names=eval_names, eval_sample_weight=eval_sample_weight, 969 eval_class_weight=eval_class_weight, eval_init_score=eval_init_score, 970 eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, 971 verbose=verbose, feature_name=feature_name, categorical_feature=categorical_feature, 972 callbacks=callbacks, init_model=init_model) 973 return self File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/sklearn.py:748, in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model) 745 evals_result = {} 746 callbacks.append(record_evaluation(evals_result)) --> 748 self._Booster = train( 749 params=params, 750 train_set=train_set, 751 num_boost_round=self.n_estimators, 752 valid_sets=valid_sets, 753 valid_names=eval_names, 754 fobj=self._fobj, 755 feval=eval_metrics_callable, 756 init_model=init_model, 757 feature_name=feature_name, 758 callbacks=callbacks 759 ) 761 if evals_result: 762 self._evals_result = evals_result File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/engine.py:271, in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks) 269 # construct booster 270 try: --> 271 booster = Booster(params=params, train_set=train_set) 272 if is_valid_contain_train: 273 booster.set_train_data_name(train_data_name) File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:2605, in Booster.__init__(self, params, train_set, model_file, model_str, silent) 2598 self.set_network( 2599 machines=machines, 2600 local_listen_port=params["local_listen_port"], 2601 listen_time_out=params.get("time_out", 120), 2602 num_machines=params["num_machines"] 2603 ) 2604 # construct booster object -> 2605 train_set.construct() 2606 # copy the parameters from train_set 2607 params.update(train_set.get_params()) File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:1815, in Dataset.construct(self) 1812 self._set_init_score_by_predictor(self._predictor, self.data, used_indices) 1813 else: 1814 # create train -> 1815 self._lazy_init(self.data, label=self.label, 1816 weight=self.weight, group=self.group, 1817 init_score=self.init_score, predictor=self._predictor, 1818 silent=self.silent, feature_name=self.feature_name, 1819 categorical_feature=self.categorical_feature, params=self.params) 1820 if self.free_raw_data: 1821 self.data = None File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:1474, in Dataset._lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params) 1472 self.pandas_categorical = reference.pandas_categorical 1473 categorical_feature = reference.categorical_feature -> 1474 data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data, 1475 feature_name, 1476 categorical_feature, 1477 self.pandas_categorical) 1478 label = _label_from_pandas(label) 1480 # process for args File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:594, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical) 592 if bad_indices: 593 bad_index_cols_str = ', '.join(data.columns[bad_indices]) --> 594 raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n" 595 "Did not expect the data types in the following fields: " 596 f"{bad_index_cols_str}") 597 data = data.values 598 if data.dtype != np.float32 and data.dtype != np.float64: ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in the following fields: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ```

Let me know if I am using the API incorrectly, or missing an argument. I tried passing the cat features list into the fit call as a kwarg, but didn't help either.

Library details:

Name: powershap
Version: 0.0.8rc2

Name: lightgbm
Version: 3.3.2

Name: pandas
Version: 1.4.2

Name: scikit-learn
Version: 1.1.1

Name: numpy
Version: 1.21.6
JarneVerhaeghe commented 2 years ago

Thank you for the issue. I have taken a look at your code and tested it and it appears this error is because of the argument requirements for LightGBM. According to the documentation of LightGBM for the categorical_feature documentation argument, the model only accepts categorical features if they use a int format when using Pandas.

Using the following snippet to convert all string categories to ints for preprocessing will solve your issue!

cats = ['Cabin', 'Embarked', 'Gender', 'Name', 'Parch', 'Pclass', 'SibSp', 'Ticket']
for col in cats:
    X[col] = X[col].factorize()[0]

The PowerShap API is used correctly in your snippet!

deepandas11 commented 2 years ago

@JarneVerhaeghe thanks for looking into this. However, I may have an alternate explanation for the documentation on lightgbm. If the columns in the dataframe are of CategoricalDtype() and are nominal, one could use the name representation of the cat_features. e.g., simply running the following line verifies that:

lgb_cl = LGBMClassifier(random_state=42, n_estimators=10, cat_features=cats)
lgb_cl.fit(X, y)