scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.84k stars 1.28k forks source link

[BUG] Inconsistent output of imblearn's pipeline #904

Closed tvdboom closed 1 year ago

tvdboom commented 2 years ago

Describe the bug

The output of imblearn's pipeline is inconsistent for fit_transform and fit().transform() (see example). The reason this happens is because in the transform method SMOTE is not applied while transforming (as expected) but in the fit_transform method SMOTE is applied while fitting and that same data is returned.

Is this intended, and if so, why? It seems quite confusing for the user. If it's indeed a bug, I think the fix is quite straight forward, although it will make the fit_transform method slower since you first have to fit the pipeline (which includes all transformations), and then transform it all again excluding the samplers.

Steps/Code to Reproduce

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

X, y =load_breast_cancer(return_X_y=True)
print(X.shape)

s = Pipeline((("smote", SMOTE()), ("scaler", StandardScaler())))
print(s.fit_transform(X, y).shape)
print(s.fit(X, y).transform(X).shape)

Output:
(569, 30)
(714, 30)
(569, 30)

Expected Results

I expected the fit_transform method to return data without balancing (same as the transform method does)

Actual Results

Versions

System: python: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] executable: C:\Users\Mavs\Documents\Python\pycaret\venv\Scripts\python.exe machine: Windows-10-10.0.19044-SP0 Python dependencies: sklearn: 1.1.1 pip: 22.0.4 setuptools: 57.0.0 numpy: 1.21.5 scipy: 1.7.3 Cython: 0.29.28 pandas: 1.4.1 matplotlib: 3.5.2 joblib: 1.1.0 threadpoolctl: 3.0.0 Built with OpenMP: True threadpoolctl info: user_api: blas internal_api: openblas prefix: libopenblas filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\numpy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll version: 0.3.17 threading_layer: pthreads architecture: Zen num_threads: 16 user_api: openmp internal_api: openmp prefix: vcomp filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\sklearn.libs\vcomp140.dll version: None num_threads: 16 user_api: blas internal_api: openblas prefix: libopenblas filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\scipy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll version: 0.3.17 threading_layer: pthreads architecture: Zen num_threads: 16 Windows-10-10.0.19044-SP0 Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.5 SciPy 1.7.3 Scikit-Learn 1.1.1 Imbalanced-Learn 0.9.1

haochunchang commented 2 years ago

Hi, after some investigation, I found that the behavior resulted from here: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6176807c9c5d68126a79771b6c0fce329f632d2f/imblearn/pipeline.py#L303-L311

In pipeline.fit_transform, since StandardScaler has fit_transform attribute, it will calls fit_transform with data transformed by SMOTE (Xt). While in pipeline.fit().transform(), the transform method iterates through the steps in the pipeline, filtering out steps with fit_resample attribute (SMOTE has it). https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6176807c9c5d68126a79771b6c0fce329f632d2f/imblearn/pipeline.py#L181-L182

I think if we want to skip samplers during fit_transform, we can let transform to handle the skipping after fit. Then fit_transform might not be slower than fit().transform() because under the hood they are doing the same thing.

I can open a PR to address this if this is indeed a bug.

tvdboom commented 2 years ago

@haochunchang The PR indeed solves the problem. Let's hope it gets merged soon

glemaitre commented 1 year ago

I think this is a case that having resampling is ambiguous compared to the usual way.

The fit_resample semantic is to apply only during the fit stage but not during transform or predict phase. Therefore, applying fit_resample when fit_transform on the pipeline makes sense.

When requesting transform then we expect to not call fit_resample since we are in the inference/decision phase. When calling fit().transform(), fit_resample is only called during the fit() stage.

This surprising API is one of the reasons why we never adopted samplers in scikit-learn because it breaks the contract fit_transform == fit.transform. Some discussion happened there: https://github.com/scikit-learn/enhancement_proposals/issues/12

At the end, I would not consider it a bug but we could improve the documentation to make it obvious.