scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.43k stars 25.26k forks source link

Error using CalibratedClassifierCV with ensemble=False using RepeatedStratifiedKFold CV #19150

Closed odedbd closed 3 years ago

odedbd commented 3 years ago

Describe the bug

I have script optimizing the params of a classifier (HGBT) wrapped by CalibratedClassifierCV, with RepeatedStratifiedKFold cross validation. This works fine, but when I tried to use the new ensemble=False option, I got the below error.

It works fine with StratifiedKFold, so I guess that the results of Repeated cv is not considered a partition, since each sample appears multiple times? Can this be supported? I can change my code to not use repeated KFold for the calibration, but for small datasets it may be useful to be able to do so.

Steps/Code to Reproduce

from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV

X, y = make_classification(n_samples=100, n_features=2,
                           n_redundant=0, random_state=42)
base_clf = HistGradientBoostingClassifier()
cv=RepeatedStratifiedKFold(n_repeats=2)
calibrated_clf = CalibratedClassifierCV(base_estimator=base_clf, cv=cv, ensemble=False)

calibrated_clf.fit(X, y)

len(calibrated_clf.calibrated_classifiers_)

calibrated_clf.predict_proba(X)[:5, :]

Expected Results

No error is thrown.

Actual Results

Traceback (most recent call last): File "C:\Users\foo\Miniconda3\envs\bar\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 4, in calibrated_clf.fit(X, y) File "C:\Users\foo\Miniconda3\envs\bar\lib\site-packages\sklearn\calibration.py", line 325, in fit predictions = _compute_predictions(pred_method, X, n_classes) File "C:\Users\foo\Miniconda3\envs\bar\lib\site-packages\sklearn\calibration.py", line 501, in _compute_predictions predictions = pred_method(X=X) File "C:\Users\foo\Miniconda3\envs\bar\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f return f(*args, **kwargs) File "C:\Users\foo\Miniconda3\envs\bar\lib\site-packages\sklearn\model_selection_validation.py", line 845, in cross_val_predict raise ValueError('cross_val_predict only works for partitions') ValueError: cross_val_predict only works for partitions

Versions

System: python: 3.6.10 |Anaconda, Inc.| (default, May 7 2020, 19:46:08) [MSC v.1916 64 bit (AMD64)] executable: C:\Users\foo\Miniconda3\envs\bar\python.exe machine: Windows-10-10.0.19041-SP0 Python dependencies: pip: 20.2.2 setuptools: 49.6.0.post20200814 sklearn: 0.24.0 numpy: 1.19.2 scipy: 1.5.2 Cython: None pandas: 1.1.3 matplotlib: 3.3.2 joblib: 0.16.0 threadpoolctl: 2.1.0 Built with OpenMP: True

NicolasHug commented 3 years ago

I guess that the results of Repeated cv is not considered a partition, since each sample appears multiple times?

Yup you got it right. Since we use cross_val_predict, each sample must appear exactly once in the test sets for the split to be a partition. It would not make sense to have 0 or more than one prediction for a given sample. If you have too few samples to have reliable estimates with ensemble=False, I'd suggest to just use ensemble=True

odedbd commented 3 years ago

Ok, got it, thanks! I'm closing the issue.