ray-project / tune-sklearn

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.
https://docs.ray.io/en/master/tune/api_docs/sklearn.html
Apache License 2.0
465 stars 52 forks source link

[bug] TuneSearchCV with hyperopt raise an error with sklearn.RandomForestClassifier #186

Closed mariesosa closed 3 years ago

mariesosa commented 3 years ago

Describe the bug

When a TuneSearchCV is performed with an unfitted sklearn.RandomForestClassifier with search_optimization="hyperopt" it raise the following error: AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_'.

Steps/Code to Reproduce

from tune_sklearn import TuneSearchCV
from ray import tune
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_blobs

# Create a test dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=5)
# Instantiate a RandomForestClassifier
clf = RandomForestClassifier()
# Define a paramater to optimize
params = {'max_depth': tune.randint(5, 100)}

# Perform hyperparameter optimization
tune_search = TuneSearchCV(
    clf, params, cv=5, scoring='roc_auc', random_state=0,
    search_optimization="hyperopt"
)
tune_search = tune_search.fit(X, y)

Expected Results

No error is thrown.

Actual Results

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-ecf3059b07c2> in <module>
      4 )
      5 
----> 6 tune_search.fit(X, y)

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/tune_sklearn/tune_basesearch.py in fit(self, X, y, groups, **fit_params)
    662                                     "To show process output, set verbose=2.")
    663 
--> 664             result = self._fit(X, y, groups, **fit_params)
    665 
    666             if not ray_init and ray.is_initialized():

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/tune_sklearn/tune_basesearch.py in _fit(self, X, y, groups, **fit_params)
    563 
    564         self._fill_config_hyperparam(config)
--> 565         analysis = self._tune_run(config, resources_per_trial)
    566 
    567         self.cv_results_ = self._format_results(self.n_splits, analysis)

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/tune_sklearn/tune_search.py in _tune_run(self, config, resources_per_trial)
    713                 "ignore", message="fail_fast='raise' "
    714                 "detected.")
--> 715             analysis = tune.run(trainable, **run_args)
    716         return analysis

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint)
    343         search_alg = BasicVariantGenerator()
    344 
--> 345     if config and not search_alg.set_search_properties(metric, mode, config):
    346         if has_unresolved_values(config):
    347             raise ValueError(

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/ray/tune/suggest/search_generator.py in set_search_properties(self, metric, mode, config)
     51     def set_search_properties(self, metric: Optional[str], mode: Optional[str],
     52                               config: Dict) -> bool:
---> 53         return self.searcher.set_search_properties(metric, mode, config)
     54 
     55     @property

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/ray/tune/suggest/hyperopt.py in set_search_properties(self, metric, mode, config)
    256             self.metric_op = 1.
    257 
--> 258         self._setup_hyperopt()
    259         return True
    260 

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/ray/tune/suggest/hyperopt.py in _setup_hyperopt(self)
    198             self._points_to_evaluate = len(self._points_to_evaluate)
    199 
--> 200         self.domain = hpo.Domain(lambda spc: spc, self._space)
    201 
    202     def _convert_categories_to_indices(self, config):

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/hyperopt/base.py in __init__(self, fn, expr, workdir, pass_expr_memo_ctrl, name, loss_target)
    833             self.pass_expr_memo_ctrl = pass_expr_memo_ctrl
    834 
--> 835         self.expr = pyll.as_apply(expr)
    836 
    837         self.params = {}

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/hyperopt/pyll/base.py in as_apply(obj)
    218         items.sort()
    219         if all(isinstance(k, six.string_types) for k in obj):
--> 220             named_args = [(k, as_apply(v)) for (k, v) in items]
    221             rval = Apply("dict", [], named_args, len(named_args))
    222         else:

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/hyperopt/pyll/base.py in <listcomp>(.0)
    218         items.sort()
    219         if all(isinstance(k, six.string_types) for k in obj):
--> 220             named_args = [(k, as_apply(v)) for (k, v) in items]
    221             rval = Apply("dict", [], named_args, len(named_args))
    222         else:

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/hyperopt/pyll/base.py in as_apply(obj)
    210         rval = Apply("pos_args", [as_apply(a) for a in obj], {}, len(obj))
    211     elif isinstance(obj, list):
--> 212         rval = Apply("pos_args", [as_apply(a) for a in obj], {}, None)
    213     elif isinstance(obj, dict):
    214         items = list(obj.items())

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/hyperopt/pyll/base.py in <listcomp>(.0)
    210         rval = Apply("pos_args", [as_apply(a) for a in obj], {}, len(obj))
    211     elif isinstance(obj, list):
--> 212         rval = Apply("pos_args", [as_apply(a) for a in obj], {}, None)
    213     elif isinstance(obj, dict):
    214         items = list(obj.items())

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/hyperopt/pyll/base.py in as_apply(obj)
    224             rval = Apply("dict", [as_apply(new_items)], {}, o_len=None)
    225     else:
--> 226         rval = Literal(obj)
    227     assert isinstance(rval, Apply)
    228     return rval

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/hyperopt/pyll/base.py in __init__(self, obj)
    541     def __init__(self, obj=None):
    542         try:
--> 543             o_len = len(obj)
    544         except TypeError:
    545             o_len = None

~/virtualenvs/tf-gpu-2.2/lib/python3.6/site-packages/sklearn/ensemble/_base.py in __len__(self)
    162     def __len__(self):
    163         """Return the number of estimators in the ensemble."""
--> 164         return len(self.estimators_)
    165 
    166     def __getitem__(self, index):

Versions

hyperopt==0.2.5 numpy==1.18.4 ray==1.2.0 scikit-learn==0.24.1 tune_sklearn==1.2.0

A bit of a clue

The configuration config["estimator_list"] = [self.estimator] in https://github.com/ray-project/tune-sklearn/blob/master/tune_sklearn/tune_search.py#L627 may be involved. Indeed, it seems to be used during the configuration of hyperopt to compute the len of the estimator.

richardliaw commented 3 years ago

@mariesosa thanks a bunch for opening this issue!

@krfricke this looks like an interesting bug where we pass too many items to the Search Algorithm to convert the search space:

(base) ➜  tune-sklearn git:(master) ✗ BETTER_EXCEPTIONS=1 python _test.py
Traceback (most recent call last):
  File "_test.py", line 18, in <module>
    tune_search = tune_search.fit(X, y)
    │             │               │  └ array([0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...
    │             │               └ array([[ -4.79205113,  -6.04797091,   1.80047773, -10.84377091,
          5.00779976],
       [ -5.20333143,  -5.19684753,   0.9...
    │             └ TuneSearchCV(cv=5, estimator=RandomForestClassifier(),
             loggers=[<class 'ray.tune.logger.CSVLogger'>,
              ...
    └ TuneSearchCV(cv=5, estimator=RandomForestClassifier(),
             loggers=[<class 'ray.tune.logger.CSVLogger'>,
              ...
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/tune_basesearch.py", line 663, in fit
    result = self._fit(X, y, groups, **fit_params)
             │         │  │  │         └ {}
             │         │  │  └ None
             │         │  └ array([0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...
             │         └ array([[ -4.79205113,  -6.04797091,   1.80047773, -10.84377091,
          5.00779976],
       [ -5.20333143,  -5.19684753,   0.9...
             └ TuneSearchCV(cv=5, estimator=RandomForestClassifier(),
             loggers=[<class 'ray.tune.logger.CSVLogger'>,
              ...
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/tune_basesearch.py", line 564, in _fit
    analysis = self._tune_run(config, resources_per_trial)
               │              │       └ {'cpu': 1, 'gpu': 0}
               │              └ {'early_stopping': False, 'early_stop_type': <EarlyStopping.NO_EARLY_STOP: 7>, 'X_id': ObjectRef(fffffffffffffffffffffffffffffff...
               └ TuneSearchCV(cv=5, estimator=RandomForestClassifier(),
             loggers=[<class 'ray.tune.logger.CSVLogger'>,
              ...
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/tune_search.py", line 715, in _tune_run
    analysis = tune.run(trainable, **run_args)
               │        │            └ {'scheduler': None, 'reuse_actors': True, 'verbose': 0, 'stop': <ray.tune.stopper.MaximumIterationStopper object at 0x7faf5083d1...
               │        └ <class 'tune_sklearn._trainable._Trainable'>
               └ <module 'ray.tune' from '/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/__init__.py'>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 428, in run
    if config and not search_alg.set_search_properties(metric, mode, config):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 53, in set_search_properties
    return self.searcher.set_search_properties(metric, mode, config)
           │                                   │       │     └ {'early_stopping': False, 'early_stop_type': <EarlyStopping.NO_EARLY_STOP: 7>, 'X_id': ObjectRef(fffffffffffffffffffffffffffffff...
           │                                   │       └ None
           │                                   └ None
           └ <ray.tune.suggest.search_generator.SearchGenerator object at 0x7faf30824150>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/suggest/hyperopt.py", line 258, in set_search_properties
    self._setup_hyperopt()
    └ <ray.tune.suggest.hyperopt.HyperOptSearch object at 0x7faf30824810>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/suggest/hyperopt.py", line 200, in _setup_hyperopt
    self.domain = hpo.Domain(lambda spc: spc, self._space)
    │             │                           └ <ray.tune.suggest.hyperopt.HyperOptSearch object at 0x7faf30824810>
    │             └ <module 'hyperopt' from '/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/__init__.py'>
    └ <ray.tune.suggest.hyperopt.HyperOptSearch object at 0x7faf30824810>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/base.py", line 822, in __init__
    self.expr = pyll.as_apply(expr)
    │           │             └ {'early_stopping': False, 'early_stop_type': <EarlyStopping.NO_EARLY_STOP: 7>, 'X_id': ObjectRef(fffffffffffffffffffffffffffffff...
    │           └ <module 'hyperopt.pyll' from '/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/pyll/__init__.py'>
    └ <hyperopt.base.Domain object at 0x7faf30ac5d10>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 220, in as_apply
    named_args = [(k, as_apply(v)) for (k, v) in items]
                      │                          └ [('X_id', ObjectRef(ffffffffffffffffffffffffffffffffffffffff0100000001000000)), ('cv', StratifiedKFold(n_splits=5, random_state=...
                      └ <function as_apply at 0x7faf105dd0e0>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 220, in <listcomp>
    named_args = [(k, as_apply(v)) for (k, v) in items]
                   │  │        │        │  └ [RandomForestClassifier()]
                   │  │        │        └ 'estimator_list'
                   │  │        └ [RandomForestClassifier()]
                   │  └ <function as_apply at 0x7faf105dd0e0>
                   └ 'estimator_list'
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 212, in as_apply
    rval = Apply('pos_args', [as_apply(a) for a in obj], {}, None)
           │                  │                    └ [RandomForestClassifier()]
           │                  └ <function as_apply at 0x7faf105dd0e0>
           └ <class 'hyperopt.pyll.base.Apply'>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 212, in <listcomp>
    rval = Apply('pos_args', [as_apply(a) for a in obj], {}, None)
           │                  │        │      └ RandomForestClassifier()
           │                  │        └ RandomForestClassifier()
           │                  └ <function as_apply at 0x7faf105dd0e0>
           └ <class 'hyperopt.pyll.base.Apply'>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 226, in as_apply
    rval = Literal(obj)
           │       └ RandomForestClassifier()
           └ <class 'hyperopt.pyll.base.Literal'>
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/hyperopt/pyll/base.py", line 543, in __init__
    o_len = len(obj)
                └ RandomForestClassifier()
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/_base.py", line 164, in __len__
    return len(self.estimators_)
               └ RandomForestClassifier()
AttributeError: 'RandomForestClassifier' object has no attribute 'estimators_'
krfricke commented 3 years ago

Oh interesting! I'll look into this more closely tomorrow!

krfricke commented 3 years ago

So this issue comes up because the self.estimators_ of the RandomForestClassifier() only gets set when fit() is called.

However, during search space initialization, HyperOpt calls len(param):


class Literal(Apply):
    def __init__(self, obj=None):
        try:
            o_len = len(obj)
        except TypeError:
            o_len = None
        Apply.__init__(self, "literal", [], {}, o_len, pure=True)
        self._obj = obj

and this does not raise a TypeError, but rather an AttributeError, because len(RandomForestClassifier()) returns the number of estimators (decision trees), which are not initialized, yet:

    def __len__(self):
        """Return the number of estimators in the ensemble."""
        return len(self.estimators_)

While I actually think that this is actually a bug of Hyperopt (it should probably catch a broad exception instead), we might be able to circumvent this by not passing the estimators as objects but pass them via object store references instead. I'll work on a fix today.