Closed max-franceschi closed 12 months ago
I think the documentation of GridSearchCV is clear about the behavior with refit=True
Thank you. You're perfectly right! The best estimator is finally refitted on the whole dataset, my bad. My "case" does not show anything then. However, could you detail how BinningProcess works during CV, please? Is it rerun on each fold? I could not understand the code enough to be sure about that.
The BinningProcess works as any other estimator passed to CV, doesn't it? It is cloned at each iteration.
Thanks for the confirmation.
Hello, I cannot find a clear answer about how BinningProcess should behave when used with GridSearchCV but it seems that the way it's done is not right.
Considering a pipeline with just a BinningProcess step and a Regressor step, in the case of a simple train_test_split, only the train set is used in the binning process, which is fine. However, when using that same pipeline within a GridSearchCV, it seems that the binning process runs on the whole dataset, not only on the train set.
In my understanding, to avoid the "contamination" of training data by test data, the binning process should only be done on the training set, otherwise out-of-sample data would be used in the binning hence in the building of the train set. Hence, in cross-validation, shouldn't the binning process be re-run on each fold?
Below is a reproducible example to highlight the results I am pointing at.
Am I correct in my understanding of how binning should work in CV? Is there an issue or am I using BinningProcess and GridSearchCV the wrong way?
Consider a simple pipeline on California housing data:
When simply using train_test_split:
The Number of records from the information of the binning process indicates 16512 and matches the size of the training set.
However, when using GridSearchCV:
The Number of records is now 20640, i.e. the size of the whole X dataset, not only a subsample: