ing-bank / probatus

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.
https://ing-bank.github.io/probatus
MIT License
130 stars 40 forks source link

not able to use ShapRFECV by BayesSearchCV #110

Closed gengbo-genentech closed 3 years ago

gengbo-genentech commented 3 years ago

I am trying to use ShapRFECV by BayesSearchCV like the code describe below.

import matplotlib.pyplot as plt
from probatus.feature_elimination import ShapRFECV
import xgboost as xgb
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import StratifiedKFold

def shapRFEcv_feature_selection(x, y):
    clf=xgb.XGBClassifier()
    param_grid = {
         'max_depth': Integer(1, 11),
         'learning_rate': Real(0.0001, 0.5, prior='log-uniform'),
         'n_estimators': Integer(50, 5000, prior='uniform'),
         'gamma': Real(0.0001, 5, prior='log-uniform'),
         'min_child_weight': Real(1, 10, prior='log-uniform'),
         'subsample': Real(0.5, 1, prior='uniform'),
         'colsample_bytree': Real(0.5, 1, prior='uniform'),
         'colsample_bylevel': Real(0.5, 1, prior='uniform'),
         'reg_alpha': Real(0.0001, 1, prior='log-uniform'),
         'reg_lambda': Real(1, 10, prior='log-uniform'),
         }
    xgb_search = BayesSearchCV(clf, search_spaces=param_grid, n_iter=32, cv=5, random_state=0, scoring='roc_auc', refit=False)
    shap_elimination = ShapRFECV(xgb_search, step=0.2, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=3)
    report = shap_elimination.fit_compute(x, y)
    performance_plot = shap_elimination.plot(figsize=(20, 15), dpi=60)
    return shap_elimination
# x and y are dataframe
shap_elimination = shapRFEcv_feature_selection(x, y)

But I got error like:

Exception: Model type not yet supported by TreeExplainer: <class 'skopt.searchcv.BayesSearchCV'>
Matgrb commented 3 years ago

Hi @gengbo-genentech! I have tried to reproduce the issue, however, it works for me. Please make sure you use probatus 1.7.0, since only from that version we support BayesSearchCV. In case that does not work, please try updating skopt. Let us know if that has helped, otherwise, we will investigate further.

gengbo-genentech commented 3 years ago

Hi @Matgrb Thank you for your response! I am wondering what the version number of skopt and sklearn you are using.

Matgrb commented 3 years ago

These ones work for me:

scikit-learn                              0.23.2               
scikit-optimize                           0.8.1 
shap                                      0.39.0        

Which ones do you use, and does it work with 1.7.0 probatus?

Matgrb commented 3 years ago

If updating probatus to 1.7.0 does not help then we need to investigate further:

What would be helpful, is running the following code:

    clf=xgb.XGBClassifier()
    param_grid = {
         'max_depth': Integer(1, 11),
         'learning_rate': Real(0.0001, 0.5, prior='log-uniform'),
         'n_estimators': Integer(50, 5000, prior='uniform'),
         'gamma': Real(0.0001, 5, prior='log-uniform'),
         'min_child_weight': Real(1, 10, prior='log-uniform'),
         'subsample': Real(0.5, 1, prior='uniform'),
         'colsample_bytree': Real(0.5, 1, prior='uniform'),
         'colsample_bylevel': Real(0.5, 1, prior='uniform'),
         'reg_alpha': Real(0.0001, 1, prior='log-uniform'),
         'reg_lambda': Real(1, 10, prior='log-uniform'),
         }
    xgb_search = BayesSearchCV(clf, search_spaces=param_grid, n_iter=32, cv=5, random_state=0, scoring='roc_auc', refit=False)
    shap_elimination = ShapRFECV(xgb_search, step=0.2, cv=StratifiedKFold(5), scoring='roc_auc', n_jobs=3) 
    print(shap_elimination.search_clf)

The search_clf boolean should be True in this case. It indicates that provided classifier is wrapped with SearchCV that performs optimization first. If the output is True, this means that probatus detects correctly that it is a BaseSearchCV, and the bug is possibly in the part where we run the optimization:

            # Optimize parameters
            if self.search_clf:
                current_search_clf = clone(self.clf).fit(current_X, self.y)
                current_clf = current_search_clf.estimator.set_params(**current_search_clf.best_params_)
            else:
                current_clf = clone(self.clf)

If the output is False, the issue must be in the following lines:

        if isinstance(self.clf, BaseSearchCV):
            self.search_clf = True
        else:
            self.search_clf = False
Matgrb commented 3 years ago

@gengbo-genentech Did upgrading to shap 1.7.0 fix the issue?

gengbo-genentech commented 3 years ago

@Matgrb I found that the sklearn 0.24.1 is not supported by skopt.BayesSearchCV currently. So I used

sklearn 0.23.2
probatus 1.7.0
skopt 0.8.1

I found that the following code works fine for me.

if self.search_clf:
    current_search_clf = clone(self.clf).fit(current_X, self.y)
    current_clf = current_search_clf.estimator.set_params(**current_search_clf.best_params_)

However shap_elimination.fit_compute() will generate error

    1 report = shap_elimination.fit_compute(x, y, check_additivity=False)
/opt/anaconda3/lib/python3.7/site-packages/probatus/feature_elimination/feature_elimination.py in fit_compute(self, X, y, columns_to_keep, column_names, **shap_kwargs)
    613         """
    614 
--> 615         self.fit(X, y, columns_to_keep=columns_to_keep, column_names=column_names, **shap_kwargs)
    616         return self.compute()
    617 

/opt/anaconda3/lib/python3.7/site-packages/probatus/feature_elimination/feature_elimination.py in fit(self, X, y, columns_to_keep, column_names, **shap_kwargs)
    501             # Optimize parameters
    502             if self.search_clf:
--> 503                 current_search_clf = clone(self.clf).fit(current_X, self.y)
    504                 current_clf = current_search_clf.estimator.set_params(**current_search_clf.best_params_)
    505             else:

/opt/anaconda3/lib/python3.7/site-packages/skopt/searchcv.py in fit(self, X, y, groups, callback)
    692                 optim_result = self._step(
    693                     X, y, search_space, optimizer,
--> 694                     groups=groups, n_points=n_points_adjusted
    695                 )
    696                 n_iter -= n_points

/opt/anaconda3/lib/python3.7/site-packages/skopt/searchcv.py in _step(self, X, y, search_space, optimizer, groups, n_points)
    563 
    564         # get parameter values to evaluate
--> 565         params = optimizer.ask(n_points=n_points)
    566 
    567         # convert parameters to python native types

/opt/anaconda3/lib/python3.7/site-packages/skopt/optimizer/optimizer.py in ask(self, n_points, strategy)
    415                 opt._tell(x, (y_lie, t_lie))
    416             else:
--> 417                 opt._tell(x, y_lie)
    418 
    419         self.cache_ = {(n_points, strategy): X}  # cache_ the result

/opt/anaconda3/lib/python3.7/site-packages/skopt/optimizer/optimizer.py in _tell(self, x, y, fit)
    534             with warnings.catch_warnings():
    535                 warnings.simplefilter("ignore")
--> 536                 est.fit(self.space.transform(self.Xi), self.yi)
    537 
    538             if hasattr(self, "next_xs_") and self.acq_func == "gp_hedge":

/opt/anaconda3/lib/python3.7/site-packages/skopt/learning/gaussian_process/gpr.py in fit(self, X, y)
    193                 noise_level=self.noise, noise_level_bounds="fixed"
    194             )
--> 195         super(GaussianProcessRegressor, self).fit(X, y)
    196 
    197         self.noise_ = None

/opt/anaconda3/lib/python3.7/site-packages/sklearn/gaussian_process/_gpr.py in fit(self, X, y)
    232             optima = [(self._constrained_optimization(obj_func,
    233                                                       self.kernel_.theta,
--> 234                                                       self.kernel_.bounds))]
    235 
    236             # Additional runs are performed from log-uniform chosen initial

/opt/anaconda3/lib/python3.7/site-packages/sklearn/gaussian_process/_gpr.py in _constrained_optimization(self, obj_func, initial_theta, bounds)
    502                 obj_func, initial_theta, method="L-BFGS-B", jac=True,
    503                 bounds=bounds)
--> 504             _check_optimize_result("lbfgs", opt_res)
    505             theta_opt, func_min = opt_res.x, opt_res.fun
    506         elif callable(self.optimizer):

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/optimize.py in _check_optimize_result(solver, result, max_iter, extra_warning_msg)
    241                 "    https://scikit-learn.org/stable/modules/"
    242                 "preprocessing.html"
--> 243             ).format(solver, result.status, result.message.decode("latin1"))
    244             if extra_warning_msg is not None:
    245                 warning_msg += "\n" + extra_warning_msg

AttributeError: 'str' object has no attribute 'decode'
Matgrb commented 3 years ago

I think this is an issue related to the scikit-optimize or sklearn packages. To confirm that you can try to run search_clf.fit(X,y) before you put them to probatus fit_compute. If the error appears there as well, it means that the issue is related to these packages. Could you test if this works or throws an error?

Alternatively, I checked online for similar issues and there are two options I see:

  1. It might be related to scikit-learn version https://stackoverflow.com/questions/66096883/attributeerror-str-object-has-no-attribute-decode-in-binary-logistic-regres . If you wait for scikit opt to start supporting newest version of sklearn, the issue might be solved.
  2. It might be scipy version https://github.com/scikit-optimize/scikit-optimize/issues/981

Also, if this issue is related to sklearn or scikit-optimize you can:

  1. Make an issue there
  2. Wait for a new release from skopt
  3. For now use RandomizedSearchCV instead in probatus.
gengbo-genentech commented 3 years ago

Hi @Matgrb This bug AttributeError: 'str' object has no attribute 'decode' is solved when I downgrade scipy version to 1.5.3. Thank you so much for your help!