guillermo-navas-palencia / optbinning

Optimal binning: monotonic binning with constraints. Support batch & stream optimal binning. Scorecard modelling and counterfactual explanations.
http://gnpalencia.org/optbinning/
Apache License 2.0
434 stars 98 forks source link

BinningProcess in Pipeline and cross-validation (GridSearchCV) #282

Closed max-franceschi closed 7 months ago

max-franceschi commented 7 months ago

Hello, I cannot find a clear answer about how BinningProcess should behave when used with GridSearchCV but it seems that the way it's done is not right.

Considering a pipeline with just a BinningProcess step and a Regressor step, in the case of a simple train_test_split, only the train set is used in the binning process, which is fine. However, when using that same pipeline within a GridSearchCV, it seems that the binning process runs on the whole dataset, not only on the train set.

In my understanding, to avoid the "contamination" of training data by test data, the binning process should only be done on the training set, otherwise out-of-sample data would be used in the binning hence in the building of the train set. Hence, in cross-validation, shouldn't the binning process be re-run on each fold?

Below is a reproducible example to highlight the results I am pointing at.

Am I correct in my understanding of how binning should work in CV? Is there an issue or am I using BinningProcess and GridSearchCV the wrong way?

Consider a simple pipeline on California housing data:

from sklearn.datasets import fetch_california_housing
from optbinning import BinningProcess
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Get data
data = fetch_california_housing()
variable_names = data.feature_names
X = data.data
y = data.target

# Create pipeline
binning_process = BinningProcess(variable_names)

ridger = Pipeline(steps=[('binning_process', binning_process),
                     ('regressor', Ridge())])

When simply using train_test_split:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ridger.fit(X_train, y_train)
print(binning_process.information(print_level=1))
print("X train shape is", X_train.shape)
print("X shape is ", X.shape)

The Number of records from the information of the binning process indicates 16512 and matches the size of the training set.

optbinning (Version 0.17.3)
--
  | Copyright (c) 2019-2022 Guillermo Navas-Palencia, Apache License 2.0
  |  
  | Statistics
  | Number of records                  16512
  | Number of variables                    8
  | Target type                   continuous
  |  
  | Number of numerical                    8
  | Number of categorical                  0
  | Number of selected                     8
  |  
  | Time                                1.6230 sec
  |  
  | None
  | X train shape is (16512, 8)
  | X shape is  (20640, 8)
  | 

However, when using GridSearchCV:

# Grid search CV
param_grid = {
    'regressor__alpha': [0.8, 1]
}

search = GridSearchCV(estimator = ridger, 
                      param_grid = param_grid, 
                      scoring = "neg_root_mean_squared_error",
                      n_jobs=-1, 
                      refit = True, # Refit an estimator using the best found parameters on the whole dataset.
                      cv = 5) #  skf.split(training,groups) for stratified KFold/ splits for time series split
search.fit(X, y)

search.best_estimator_.named_steps['binning_process'].information(print_level=1)

The Number of records is now 20640, i.e. the size of the whole X dataset, not only a subsample:

optbinning (Version 0.17.3)
--
  | Copyright (c) 2019-2022 Guillermo Navas-Palencia, Apache License 2.0
  |  
  | Statistics
  | Number of records                  20640
  | Number of variables                    8
  | Target type                   continuous
  |  
  | Number of numerical                    8
  | Number of categorical                  0
  | Number of selected                     8
  |  
  | Time                                2.1036 sec
  |  
  | 
guillermo-navas-palencia commented 7 months ago

I think the documentation of GridSearchCV is clear about the behavior with refit=True

max-franceschi commented 7 months ago

Thank you. You're perfectly right! The best estimator is finally refitted on the whole dataset, my bad. My "case" does not show anything then. However, could you detail how BinningProcess works during CV, please? Is it rerun on each fold? I could not understand the code enough to be sure about that.

guillermo-navas-palencia commented 7 months ago

The BinningProcess works as any other estimator passed to CV, doesn't it? It is cloned at each iteration.

max-franceschi commented 7 months ago

Thanks for the confirmation.