scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.8k stars 1.28k forks source link

[BUG] Error when using transformer caching with Pipeline object #685

Closed zoj613 closed 4 years ago

zoj613 commented 4 years ago

Description

Calling fit method of Pipeline object throws an expection: UnboundLocalError: local variable 'cloned_transformer' referenced before assignment, when the memory argument is passed an argument. Therfore I am unable to cache any transformers (especially during hyperparameter tuning using a Pipeline object.

Steps/Code to Reproduce

Example:

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier

X, y = make_classification()
steps = [
    ('scaler', StandardScaler()),
    ('sampler', RandomOverSampler()),
    ('clf', SGDClassifier())
]
p = Pipeline(steps, memory='./data/')
p.fit(X, y)

Expected Results

For this run successfully

Actual Results

~/.pyenv/versions/3.6.8/envs/train/lib/python3.6/site-packages/imblearn/pipeline.py in fit(self, X, y, **fit_params)
    285 
    286         """
--> 287         Xt, yt, fit_params = self._fit(X, y, **fit_params)
    288         with _print_elapsed_time('Pipeline',
    289                                  self._log_message(len(self.steps) - 1)):

~/.pyenv/versions/3.6.8/envs/train/lib/python3.6/site-packages/imblearn/pipeline.py in _fit(self, X, y, **fit_params)
    233                 cloned_transformer = clone(transformer)
    234             # Fit or load from cache the current transfomer
--> 235             if hasattr(cloned_transformer, "transform") or hasattr(
    236                 cloned_transformer, "fit_transform"
    237             ):

UnboundLocalError: local variable 'cloned_transformer' referenced before assignment

Versions

Linux-4.15.0-1058-aws-x86_64-with-debian-buster-sid Python 3.6.8 (default, Nov 18 2019, 13:36:54) [GCC 6.5.0 20181026] NumPy 1.18.1 SciPy 1.4.1 Scikit-Learn 0.22.1 Imbalanced-Learn 0.6.1

zoj613 commented 4 years ago

ping @glemaitre what do you think could be the issue here? sklearn's pipeline works just fine without the error (if i drop the oversampler instance)

glemaitre commented 4 years ago

I cannot reproduce the error in a clean conda environment with the version specified. Could you reinstall imbalanced-learn. I don't see why the error should occur indeed.

chkoar commented 4 years ago

@zoj613 are you sure that the code you are executing is the one you provided? It works on my machine. A similar test case it already exists in our test suite.

zoj613 commented 4 years ago

@glemaitre @chkoar I use poetry for package management and have set imbalance-learn to the version in current master branch. I just updated the packages and reran the code, I get the same error in my pyenv environment. This is the input/output on ipython:

In [3]: from sklearn.datasets import make_classification                                                                                                                          

In [4]: from imblearn.pipeline import Pipeline 
   ...: from imblearn.over_sampling import RandomOverSampler 
   ...: from sklearn.preprocessing import StandardScaler 
   ...: from sklearn.linear_model import SGDClassifier 
   ...:  
   ...: X, y = make_classification() 
   ...: steps = [ 
   ...:     ('scaler', StandardScaler()), 
   ...:     ('sampler', RandomOverSampler()), 
   ...:     ('clf', SGDClassifier()) 
   ...: ] 
   ...: p = Pipeline(steps, memory='./data/') 
   ...: p.fit(X, y)                                                                                                                                                               
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-4-5effb3d2fca4> in <module>
     11 ]
     12 p = Pipeline(steps, memory='./data/')
---> 13 p.fit(X, y)

~/.pyenv/versions/3.6.8/envs/absa-py36/lib/python3.6/site-packages/imblearn/pipeline.py in fit(self, X, y, **fit_params)
    285 
    286         """
--> 287         Xt, yt, fit_params = self._fit(X, y, **fit_params)
    288         with _print_elapsed_time('Pipeline',
    289                                  self._log_message(len(self.steps) - 1)):

~/.pyenv/versions/3.6.8/envs/absa-py36/lib/python3.6/site-packages/imblearn/pipeline.py in _fit(self, X, y, **fit_params)
    233                 cloned_transformer = clone(transformer)
    234             # Fit or load from cache the current transfomer
--> 235             if hasattr(cloned_transformer, "transform") or hasattr(
    236                 cloned_transformer, "fit_transform"
    237             ):

UnboundLocalError: local variable 'cloned_transformer' referenced before assignment

I really have no idea what could be causing this. It has been happening ever since I used imbalanced learn's pipeline object.

Here is a list of packages and their version:

absl-py              0.9.0                   Abseil Python Common Libraries, see https://github.com/abseil/abseil-py.
appdirs              1.4.3                   A small Python module for determining appropriate platform-specific dirs, e.g. a "user data dir".
asciimatics          1.11.0                  A cross-platform package to replace curses (mouse/keyboard input & text colours/positioning) and create ASCII animations
astor                0.8.1                   Read/rewrite/write Python ASTs
astroid              2.3.3                   An abstract syntax tree for Python with inference support.
astropy              4.0                     Community-developed python astronomy tools
atpublic             1.0                     public -- @public for populating __all__
attrs                19.3.0                  Classes Without Boilerplate
backcall             0.1.0                   Specifications for callback functions passed in to an API
beautifulsoup4       4.8.2                   Screen-scraping library
black                19.10b0                 The uncompromising code formatter.
bleach               3.1.1                   An easy safelist-based HTML-sanitizing tool.
Boruta               0.1.5 f39c077           Python Implementation of Boruta Feature Selection
bs4                  0.0.1                   Dummy package for Beautiful Soup
catboost             0.21                    Catboost Python Package
category-encoders    2.1.0 4a57405           A collection sklearn transformers to encode categorical variables as numeric
certifi              2019.11.28              Python package for providing Mozilla's CA Bundle.
chardet              3.0.4                   Universal encoding detector for Python 2 and 3
civisml-extensions   0.2.1                   scikit-learn-compatible estimators from Civis Analytics
click                7.0                     Composable command line interface toolkit
cloudpickle          1.3.0                   Extended pickling support for Python objects
colorama             0.4.3                   Cross-platform colored terminal text.
configobj            5.0.6                   Config file reading, writing and validation.
configparser         4.0.2                   Updated configparser from Python 3.7 for Python 2.6+.
confuse              1.0.0                   painless YAML configuration
cycler               0.10.0                  Composable style cycles
dask                 2.11.0                  Parallel PyData with Task Scheduling
decorator            4.4.1                   Decorators for Humans
defusedxml           0.6.0                   XML bomb protection for Python stdlib modules
deslib               0.3                     Implementation of Dynamic Ensemble Selection methods
dill                 0.3.1.1                 serialize all of python
distro               1.4.0                   Distro - an OS platform information API
ds-lime              0.1.1.27                Local Interpretable Model-Agnostic Explanations for machine learning classifiers
dvc                  0.71.0                  Git for data scientists - manage your code and data together
entrypoints          0.3                     Discover and load entry points from installed packages.
flake8               3.7.9                   the modular source code checker: pep8, pyflakes and co
flufl.lock           3.2                     NFS-safe file locking with timeouts for POSIX systems.
funcy                1.14                    A fancy and practical functional tools
future               0.18.2                  Clean single-source support for Python 3 and 2
gast                 0.3.3                   Python AST that abstracts the underlying Python version
gitdb2               3.0.2                   Git Object Database
gitpython            3.0.8                   Python Git Library
grandalf             0.6                     Graph and drawing algorithms framework
graphviz             0.13.2                  Simple Python interface for Graphviz
grpcio               1.27.2                  HTTP/2-based RPC framework
h5py                 2.10.0                  Read and write HDF5 files from Python
htmlmin              0.1.12                  An HTML Minifier
humanize             1.0.0                   Python humanize utilities
hyperopt             0.2.3                   Distributed Asynchronous Hyperparameter Optimization
idna                 2.9                     Internationalized Domain Names in Applications (IDNA)
imbalanced-learn     0.6.2                   Toolbox for imbalanced dataset in machine learning.
importlib-metadata   1.5.0                   Read metadata from Python packages
inflect              4.1.0                   Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words
ipdb                 0.12.3                  IPython-enabled pdb
ipykernel            5.1.4                   IPython Kernel for Jupyter
ipython              7.12.0                  IPython: Productive Interactive Computing
ipython-genutils     0.2.0                   Vestigial utilities from IPython
isort                4.3.21                  A Python utility / library to sort Python imports.
jedi                 0.16.0                  An autocompletion tool for Python that can be used for text editors.
jinja2               2.10                    A small but fast and easy to use stand-alone template engine written in pure python.
joblib               0.11                    Lightweight pipelining: using Python functions as pipeline jobs.
json5                0.9.1                   A Python implementation of the JSON5 data format.
jsonpath-ng          1.4.3                   A final implementation of JSONPath for Python that aims to be standard compliant, including arithmetic and binary comparison ope...
jsonschema           3.2.0                   An implementation of JSON Schema validation for Python
jupyter-client       5.3.4                   Jupyter protocol implementation and client libraries
jupyter-core         4.6.3                   Jupyter core package. A base package on which Jupyter projects rely.
jupyterlab           1.2.6                   The JupyterLab notebook server extension.
jupyterlab-server    1.0.6                   JupyterLab Server
keras-applications   1.0.8                   Reference implementations of popular deep learning models
keras-preprocessing  1.1.0                   Easy data preprocessing and data augmentation for deep learning models
kiwisolver           1.1.0                   A fast implementation of the Cassowary constraint solver
lazy-object-proxy    1.4.3                   A fast and thorough lazy object proxy.
lightgbm             2.3.1                   LightGBM Python Package
llvmlite             0.31.0                  lightweight wrapper around basic LLVM functionality
markdown             3.2.1                   Python implementation of Markdown.
markupsafe           1.1.1                   Safely add untrusted strings to HTML/XML markup.
matplotlib           3.1.3                   Python plotting package
mccabe               0.6.1                   McCabe checker, plugin for flake8
missingno            0.4.2                   Missing data visualization module for Python.
mistune              0.8.4                   The fastest markdown parser in pure Python
mock                 4.0.1                   Rolling backport of unittest.mock for all Pythons
more-itertools       8.2.0                   More routines for operating on iterables, beyond itertools
multiprocess         0.70.9                  better multiprocessing and multithreading in python
nanotime             0.5.2                   nanotime python implementation
nbconvert            5.6.1                   Converting Jupyter Notebooks
nbformat             5.0.4                   The Jupyter Notebook format
networkx             2.2                     Python package for creating and manipulating graphs and networks
notebook             6.0.3                   A web-based notebook environment for interactive computing
notedown             1.5.1                   Convert markdown to IPython notebook.
numba                0.48.0                  compiling Python code using LLVM
numpy                1.18.1                  NumPy is the fundamental package for array computing with Python.
packaging            20.1                    Core utilities for Python packages
pandas               0.25.3                  Powerful data structures for data analysis, time series, and statistics
pandas-profiling     2.4.0                   Generate profile report for pandas DataFrame
pandoc-attributes    0.1.7                   An Attribute class to be used with pandocfilters
pandocfilters        1.4.2                   Utilities for writing pandoc filters in python
parso                0.6.1                   A Python Parser
pathspec             0.7.0                   Utility library for gitignore style pattern matching of file paths.
patsy                0.5.1                   A Python package for describing statistical models and for building design matrices.
pexpect              4.8.0                   Pexpect allows easy control of interactive console applications.
phik                 0.9.8                   Phi_K correlation analyzer library
pickleshare          0.7.5                   Tiny 'shelve'-like database with concurrency support
pillow               7.0.0                   Python Imaging Library (Fork)
plotly               4.5.1                   An open-source, interactive graphing library for Python
pluggy               0.13.1                  plugin and hook calling mechanisms for python
ply                  3.11                    Python Lex & Yacc
prometheus-client    0.7.1                   Python client for the Prometheus monitoring system.
prompt-toolkit       3.0.3                   Library for building powerful interactive command lines in Python
protobuf             3.11.3                  Protocol Buffers
ptyprocess           0.6.0                   Run a subprocess in a pseudo terminal
py                   1.8.1                   library with cross-python path, ini-parsing, io, code, log facilities
py4j                 0.10.7                  Enables Python programs to dynamically access arbitrary Java objects
pyaml                19.12.0                 PyYAML-based module to produce pretty and readable YAML-serialized data
pyarrow              0.15.1                  Python library for Apache Arrow
pyasn1               0.4.8                   ASN.1 types and codecs
pycodestyle          2.5.0                   Python style guide checker
pydotplus            2.0.2                   Python interface to Graphviz's Dot language
pyfiglet             0.8.post1               Pure-python FIGlet implementation
pyflakes             2.1.1                   passive checker of Python programs
pygments             2.5.2                   Pygments is a syntax highlighting package written in Python.
pylint               2.4.4                   python code static checker
pymongo              3.10.1                  Python driver for MongoDB <http://www.mongodb.org>
pyparsing            2.4.6                   Python parsing module
pyrsistent           0.15.7                  Persistent/Functional/Immutable data structures
pyspark              2.4.5                   Apache Spark Python API
pytest               5.3.5                   pytest: simple powerful testing with Python
pytest-pylint        0.15.0                  pytest plugin to check source code with pylint
python-dateutil      2.8.1                   Extensions to the standard Python datetime module
pytz                 2019.3                  World timezone definitions, modern and historical
pywavelets           1.1.1                   PyWavelets, wavelet transform module
pyyaml               5.3                     YAML parser and emitter for Python
pyzmq                18.1.1                  Python bindings for 0MQ
regex                2020.2.18               Alternative regular expression module, to replace re.
requests             2.23.0                  Python HTTP for Humans.
retrying             1.3.3                   Retrying
ruamel.yaml          0.16.10                 ruamel.yaml is a YAML parser/emitter that supports roundtrip preservation of comments, seq/map flow style, and map key order
ruamel.yaml.clib     0.2.0                   C version of reader, parser and emitter for ruamel.yaml derived from libyaml
scikit-image         0.14.0                  Image processing routines for SciPy
scikit-learn         0.22.1                  A set of python modules for machine learning and data mining
scikit-optimize      0.6+19.g180d6be 180d6be Sequential model-based optimization toolbox.
scikit-plot          0.3.7                   An intuitive library to add plotting functionality to scikit-learn objects.
scipy                1.4.1                   SciPy: Scientific Library for Python
seaborn              0.9.1                   seaborn: statistical data visualization
send2trash           1.5.0                   Send file to trash natively under Mac OS X, Windows and Linux.
shap                 0.32.1                  A unified approach to explain the output of any machine learning model.
shortuuid            0.5.0                   A generator library for concise, unambiguous and URL-safe UUIDs.
six                  1.14.0                  Python 2 and 3 compatibility utilities
skater               1.1.2                   Model Interpretation Library
skorch               0.7.0                   scikit-learn compatible neural network library for pytorch
smmap2               2.0.5                   A pure Python implementation of a sliding window memory map manager
soupsieve            1.9.5                   A modern CSS selector implementation for Beautiful Soup.
statsmodels          0.11.0                  Statistical computations and models for Python
tabulate             0.8.6                   Pretty-print tabular data
tensorboard          1.13.1                  TensorBoard lets you watch Tensors Flow
tensorflow-estimator 1.13.0                  TensorFlow Estimator.
tensorflow-gpu       1.13.2                  TensorFlow is an open source machine learning framework for everyone.
termcolor            1.1.0                   ANSII Color formatting for output in terminal.
terminado            0.8.3                   Terminals served to xterm.js using Tornado websockets
testpath             0.4.4                   Test utilities for code working with files and commands
toml                 0.10.0                  Python Library for Tom's Obvious, Minimal Language
toolz                0.10.0                  List processing tools and functional utilities
torch                1.4.0                   Tensors and Dynamic neural networks in Python with strong GPU acceleration
tornado              6.0.3                   Tornado is a Python web framework and asynchronous networking library, originally developed at FriendFeed.
tqdm                 4.43.0                  Fast, Extensible Progress Meter
traitlets            4.3.3                   Traitlets Python config system
treelib              1.5.5                   A Python 2/3 implementation of tree structure.
typed-ast            1.4.1                   a fork of Python 2 and 3 ast modules with type comment support
urllib3              1.25.8                  HTTP library with thread-safe connection pooling, file post, and more.
voluptuous           0.11.7                  # Voluptuous is a Python data validation library
wcwidth              0.1.8                   Measures number of Terminal column cells of wide-character codes
webencodings         0.5.1                   Character encoding aliases for legacy web content
werkzeug             1.0.0                   The comprehensive WSGI web application library.
wheel                0.34.2                  A built-package format for Python
wordcloud            1.3.1                   A little word cloud generator
wrapt                1.11.2                  Module for decorators, wrappers and monkey patching.
xgboost              0.90                    XGBoost Python Package
yellowbrick          1.0.1                   A suite of visual analysis and diagnostic tools for machine learning.
zipp                 3.0.0                   Backport of pathlib-compatible object wrapper for zip files
chkoar commented 4 years ago

@zoj613 just in case, can you please upgrade your joblib version? If I am not wrong it seems fairly old.

zoj613 commented 4 years ago

@chkoar updating joblib to the latest version worked like a charm, thank you. It was being set to that version because of the version of skater being used which required version 0.11. After downgrading skater to 1.0.4 I was able to upgrade joblib to 0.14.1

In [1]: from sklearn.datasets import make_classification                                                                                                                          

In [2]: from imblearn.pipeline import Pipeline 
   ...: from imblearn.over_sampling import RandomOverSampler 
   ...: from sklearn.preprocessing import StandardScaler 
   ...: from sklearn.linear_model import SGDClassifier 
   ...:  
   ...: X, y = make_classification() 
   ...: steps = [ 
   ...:     ('scaler', StandardScaler()), 
   ...:     ('sampler', RandomOverSampler()), 
   ...:     ('clf', SGDClassifier()) 
   ...: ] 
   ...: p = Pipeline(steps, memory='./data/') 
   ...: p.fit(X, y) 
Out[2]: 
Pipeline(memory='./data/',
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('sampler',
                 RandomOverSampler(random_state=None,
                                   sampling_strategy='auto')),
                ('clf',
                 SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='hinge',
                               max_iter=1000, n_iter_no_change=5, n_jobs=None,
                               penalty='l2', power_t=0.5, random_state=None,
                               shuffle=True, tol=0.001, validation_fraction=0.1,
                               verbose=0, warm_start=False))],
         verbose=False)

In [3]:                                                                                                                                                                           
chkoar commented 4 years ago

@zoj613 great. @glemaitre the problem is here. When an old joblib version is installed we check only if the cachedir is None. Probably indenting this else will solve the problem.

glemaitre commented 4 years ago

Oh yes, this should be the bug