Closed tvdboom closed 1 year ago
Hi, after some investigation, I found that the behavior resulted from here: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6176807c9c5d68126a79771b6c0fce329f632d2f/imblearn/pipeline.py#L303-L311
In pipeline.fit_transform
, since StandardScaler
has fit_transform
attribute, it will calls fit_transform
with data transformed by SMOTE
(Xt
).
While in pipeline.fit().transform()
, the transform
method iterates through the steps in the pipeline, filtering out steps with fit_resample
attribute (SMOTE
has it).
https://github.com/scikit-learn-contrib/imbalanced-learn/blob/6176807c9c5d68126a79771b6c0fce329f632d2f/imblearn/pipeline.py#L181-L182
I think if we want to skip samplers during fit_transform
, we can let transform
to handle the skipping after fit. Then fit_transform
might not be slower than fit().transform()
because under the hood they are doing the same thing.
I can open a PR to address this if this is indeed a bug.
@haochunchang The PR indeed solves the problem. Let's hope it gets merged soon
I think this is a case that having resampling is ambiguous compared to the usual way.
The fit_resample
semantic is to apply only during the fit
stage but not during transform
or predict
phase. Therefore, applying fit_resample
when fit_transform
on the pipeline makes sense.
When requesting transform
then we expect to not call fit_resample
since we are in the inference/decision phase. When calling fit().transform()
, fit_resample
is only called during the fit()
stage.
This surprising API is one of the reasons why we never adopted samplers in scikit-learn because it breaks the contract fit_transform == fit.transform
. Some discussion happened there: https://github.com/scikit-learn/enhancement_proposals/issues/12
At the end, I would not consider it a bug but we could improve the documentation to make it obvious.
Describe the bug
The output of imblearn's pipeline is inconsistent for
fit_transform
andfit().transform()
(see example). The reason this happens is because in thetransform
method SMOTE is not applied while transforming (as expected) but in thefit_transform
method SMOTE is applied while fitting and that same data is returned.Is this intended, and if so, why? It seems quite confusing for the user. If it's indeed a bug, I think the fix is quite straight forward, although it will make the
fit_transform
method slower since you first have to fit the pipeline (which includes all transformations), and then transform it all again excluding the samplers.Steps/Code to Reproduce
Expected Results
I expected the
fit_transform
method to return data without balancing (same as thetransform
method does)Actual Results
Versions
System: python: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] executable: C:\Users\Mavs\Documents\Python\pycaret\venv\Scripts\python.exe machine: Windows-10-10.0.19044-SP0 Python dependencies: sklearn: 1.1.1 pip: 22.0.4 setuptools: 57.0.0 numpy: 1.21.5 scipy: 1.7.3 Cython: 0.29.28 pandas: 1.4.1 matplotlib: 3.5.2 joblib: 1.1.0 threadpoolctl: 3.0.0 Built with OpenMP: True threadpoolctl info: user_api: blas internal_api: openblas prefix: libopenblas filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\numpy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll version: 0.3.17 threading_layer: pthreads architecture: Zen num_threads: 16 user_api: openmp internal_api: openmp prefix: vcomp filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\sklearn.libs\vcomp140.dll version: None num_threads: 16 user_api: blas internal_api: openblas prefix: libopenblas filepath: C:\Users\Mavs\Documents\Python\pycaret\venv\Lib\site-packages\scipy.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll version: 0.3.17 threading_layer: pthreads architecture: Zen num_threads: 16 Windows-10-10.0.19044-SP0 Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.5 SciPy 1.7.3 Scikit-Learn 1.1.1 Imbalanced-Learn 0.9.1