ray-project / tune-sklearn

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.
https://docs.ray.io/en/master/tune/api_docs/sklearn.html
Apache License 2.0
465 stars 52 forks source link

Exception Handling: seeing a `_queue.Empty` error #135

Closed richardliaw closed 3 years ago

richardliaw commented 3 years ago

If you run:

(base) ➜  tune-sklearn git:(raise-tuneerror) ✗ cat _test.py
import numpy as np
from tune_sklearn import TuneGridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, SGDClassifier
parameter_grid = {"alpha": [1e-4], "epsilon": [0.01]}

X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1], [2, 1]])
y = np.array([1, 1, 2, 2, 2])

tune_search = TuneGridSearchCV(
    SGDClassifier(), parameter_grid, scoring="f1_micro")# , max_iters=20)
tune_search.fit(X, y)
print("done")

You get this exception:

(base) ➜  tune-sklearn git:(raise-tuneerror) ✗ python _test.py
File descriptor limit 256 is too low for production servers and may result in connection errors. At least 8192 is recommended. --- Fix with 'ulimit -n 8192'
Trial _Trainable_9426a_00000: Error processing event.
Traceback (most recent call last):
  File "_test.py", line 14, in <module>
    tune_search.fit(X, y)
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/tune_basesearch.py", line 650, in fit
    result = self._fit(X, y, groups, **fit_params)
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/tune_basesearch.py", line 551, in _fit
    analysis = self._tune_run(config, resources_per_trial)
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/tune_gridsearch.py", line 259, in _tune_run
    loggers=self.loggers)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 416, in run
    runner.step()
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 389, in step
    self._process_events()  # blocking
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 532, in _process_events
    self._process_trial(trial)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/worker.py", line 1473, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_Trainable.train() (pid=29663, ip=192.168.1.115)
  File "/Users/rliaw/miniconda3/lib/python3.7/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

ray::_Trainable.train() (pid=29663, ip=192.168.1.115)
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 183, in train
    result = self.step()
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/_trainable.py", line 106, in step
    return self._train()
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/_trainable.py", line 247, in _train
    return_train_score=self.return_train_score,
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 248, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/joblib/parallel.py", line 819, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 243, in <genexpr>
    delayed(_fit_and_score)(
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 336, in split
    for train, test in super().split(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 80, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 697, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 668, in _make_test_folds
    % (self.n_splits))
ValueError: n_splits=5 cannot be greater than the number of members in each class.

The fix is probably to set n_jobs in crossvalidate to 1. cc @amogkam

richardliaw commented 3 years ago

But instead, after doing a dive into the exception handling, the chain of exceptions should actually be:

Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/joblib/parallel.py", line 808, in dispatch_one_batch
    tasks = self._ready_batches.get(block=False)
  File "/Users/rliaw/miniconda3/lib/python3.7/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 431, in ray._raylet.execute_task.function_executor
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/function_manager.py", line 553, in actor_method_executor
    return method(actor, *args, **kwargs)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 183, in train
    result = self.step()
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/_trainable.py", line 107, in step
    result = self._train()
  File "/Users/rliaw/dev/tune-sklearn/tune_sklearn/_trainable.py", line 251, in _train
    return_train_score=self.return_train_score,
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 248, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/joblib/parallel.py", line 819, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 243, in <genexpr>
    delayed(_fit_and_score)(
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 336, in split
    for train, test in super().split(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 80, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 697, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/Users/rliaw/miniconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 668, in _make_test_folds
    % (self.n_splits))
ValueError: n_splits=5 cannot be greater than the number of members in each class.