Raise Errors on trial failures instead of logging them

amogkam commented 3 years ago

This PR raises underlying errors on trial failure instead of simply just logging them and catching the exception.

Closes #108.

Depends on https://github.com/ray-project/ray/pull/11842

When running the following code:

import numpy as np
from tune_sklearn import TuneGridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, SGDClassifier
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}

X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1], [2, 1]])
y = np.array([1, 1, 2, 2, 2])

tune_search = TuneGridSearchCV(
    SGDClassifier(), parameter_grid, scoring="f1_micro", max_iters=20)
tune_search.fit(X, y)
print("Success!")

this is the stacktrace without this PR:

/Users/amog/dev/tune-sklearn/tune_sklearn/tune_basesearch.py:391: UserWarning: max_iters is set > 1 but incremental/partial training is not enabled. To enable partial training, ensure the estimator has `partial_fit` or `warm_start` and set `early_stopping=True`. Automatically setting max_iters=1.
  category=UserWarning)
Trial _Trainable_7be32_00001: Error processing event.
Traceback (most recent call last):
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_Trainable.train() (pid=73967, ip=192.168.2.228)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

ray::_Trainable.train() (pid=73967, ip=192.168.2.228)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/trainable.py", line 183, in train
    result = self.step()
  File "/Users/amog/dev/tune-sklearn/tune_sklearn/_trainable.py", line 106, in step
    return self._train()
  File "/Users/amog/dev/tune-sklearn/tune_sklearn/_trainable.py", line 247, in _train
    return_train_score=self.return_train_score,
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 248, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/joblib/parallel.py", line 819, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 243, in <genexpr>
    delayed(_fit_and_score)(
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 336, in split
    for train, test in super().split(X, y, groups):
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 80, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 697, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 668, in _make_test_folds
    % (self.n_splits))
ValueError: n_splits=5 cannot be greater than the number of members in each class.
Success!

Notice how the errors are logged, but not raised so program execution still continues.

With this PR:

/Users/amog/dev/tune-sklearn/tune_sklearn/tune_basesearch.py:391: UserWarning: max_iters is set > 1 but incremental/partial training is not enabled. To enable partial training, ensure the estimator has `partial_fit` or `warm_start` and set `early_stopping=True`. Automatically setting max_iters=1.
  category=UserWarning)
Trial _Trainable_568d9_00002: Error processing event.
Traceback (most recent call last):
  File "test.py", line 14, in <module>
    tune_search.fit(X, y)
  File "/Users/amog/dev/tune-sklearn/tune_sklearn/tune_basesearch.py", line 650, in fit
    result = self._fit(X, y, groups, **fit_params)
  File "/Users/amog/dev/tune-sklearn/tune_sklearn/tune_basesearch.py", line 551, in _fit
    analysis = self._tune_run(config, resources_per_trial)
  File "/Users/amog/dev/tune-sklearn/tune_sklearn/tune_gridsearch.py", line 259, in _tune_run
    loggers=self.loggers)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/tune.py", line 416, in run
    runner.step()
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 389, in step
    self._process_events()  # blocking
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 532, in _process_events
    self._process_trial(trial)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 547, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 484, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/worker.py", line 1449, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_Trainable.train() (pid=73844, ip=192.168.2.228)
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

ray::_Trainable.train() (pid=73844, ip=192.168.2.228)
  File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/ray/tune/trainable.py", line 183, in train
    result = self.step()
  File "/Users/amog/dev/tune-sklearn/tune_sklearn/_trainable.py", line 106, in step
    return self._train()
  File "/Users/amog/dev/tune-sklearn/tune_sklearn/_trainable.py", line 247, in _train
    return_train_score=self.return_train_score,
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 248, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/joblib/parallel.py", line 1029, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/joblib/parallel.py", line 819, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 243, in <genexpr>
    delayed(_fit_and_score)(
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 336, in split
    for train, test in super().split(X, y, groups):
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 80, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 697, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/Users/amog/dev/ray/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 668, in _make_test_folds
    % (self.n_splits))
ValueError: n_splits=5 cannot be greater than the number of members in each class.

amogkam commented 3 years ago

cc @richardliaw @inventormc is this enough to fix https://github.com/ray-project/tune-sklearn/issues/108? If so, I can add a test.

richardliaw commented 3 years ago

Hi @amogkam, for this PR, can you be sure to post an example script and example output of what we expect to see?

richardliaw commented 3 years ago

I went ahead and ran this with:

In [1]: # from sklearn.model_selection import GridSearchCV
   ...: from tune_sklearn import TuneGridSearchCV
   ...: from sklearn.ensemble import RandomForestClassifier
   ...: # Other imports
   ...: import time
   ...: import numpy as np
   ...: from sklearn.datasets import make_classification
   ...: from sklearn.model_selection import train_test_split
   ...: from sklearn.linear_model import SGDClassifier
   ...:
   ...: # Set training and validation sets
   ...: X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
   ...: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
   ...:
   ...: # Example parameters to tune from SGDClassifier
   ...: param_grid = {
   ...:         'n_estimators': [100,200,300],
   ...:         'max_depth':range(5,30,5),
   ...:         'min_samples_split':[1,2,5,10,15],
   ...:         'min_samples_leaf':[1,2,5,10],
   ...:         'max_features': ['log2','sqrt']
   ...:     }
   ...: forest_clf = RandomForestClassifier(random_state=42,warm_start=True)
   ...:
   ...: grid_search = TuneGridSearchCV(forest_clf, param_grid, cv=5,scoring='accuracy',use_gpu=True)
   ...:
   ...: start=time.time()
   ...: grid_search.fit(X_train, y_train)
   ...: end=time.time()
   ...: print('Tune time: ',end-start)
   ...: score=grid_search.score(X_test,y_test)
   ...: print("Tune Score:", score)

And this is the stacktrace that I got:

---------------------------------------------------------------------------
TuneError                                 Traceback (most recent call last)
<ipython-input-1-f83e200c8718> in <module>
     26
     27 start=time.time()
---> 28 grid_search.fit(X_train, y_train)
     29 end=time.time()
     30 print('Tune time: ',end-start)

~/dev/tune-sklearn/tune_sklearn/tune_basesearch.py in fit(self, X, y, groups, **fit_params)
    658             if not ray_init and ray.is_initialized():
    659                 ray.shutdown()
--> 660             raise e
    661
    662     def score(self, X, y=None):

~/dev/tune-sklearn/tune_sklearn/tune_basesearch.py in fit(self, X, y, groups, **fit_params)
    648                                     "To show process output, set verbose=2.")
    649
--> 650             result = self._fit(X, y, groups, **fit_params)
    651
    652             if not ray_init and ray.is_initialized():

~/dev/tune-sklearn/tune_sklearn/tune_basesearch.py in _fit(self, X, y, groups, **fit_params)
    549
    550         self._fill_config_hyperparam(config)
--> 551         analysis = self._tune_run(config, resources_per_trial)
    552
    553         self.cv_results_ = self._format_results(self.n_splits, analysis)

~/dev/tune-sklearn/tune_sklearn/tune_gridsearch.py in _tune_run(self, config, resources_per_trial)
    254                 resources_per_trial=resources_per_trial,
    255                 local_dir=os.path.expanduser(self.local_dir),
--> 256                 loggers=self.loggers)
    257
    258         return analysis

~/miniconda3/lib/python3.7/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, loggers, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint)
    409     tune_start = time.time()
    410     while not runner.is_finished():
--> 411         runner.step()
    412         if verbose:
    413             _report_progress(runner, progress_reporter)

~/miniconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py in step(self)
    570             self._process_events()  # blocking
    571         else:
--> 572             self.trial_executor.on_no_available_trials(self)
    573
    574         self._stop_experiment_if_needed()

~/miniconda3/lib/python3.7/site-packages/ray/tune/trial_executor.py in on_no_available_trials(self, trial_runner)
    175                         )
    176                     raise TuneError(
--> 177                         "Insufficient cluster resources to launch trial: "
    178                         f"trial requested {resource_string}, but the cluster "
    179                         f"has only {self.resource_string()}. "

TuneError: Insufficient cluster resources to launch trial: trial requested 1 CPUs, 1 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 25.34 GiB heap, 8.74 GiB objects (1.0 node:192.168.1.115).

I agree that this stacktrace is a better error message than before. However, there's a lot of context in the error message that makes it hard to parse. For the scikit-learn user, they will not know about Ray's resource management protocols. Hence, we'll want to raise an error like "We detected use_gpus but no GPUs found."

amogkam commented 3 years ago

@richardliaw thanks, this makes sense for this specific case, but we would have to write custom error messages for all TuneErrors right? Is there a more general solution?

richardliaw commented 3 years ago

The long-term solution would be to update Tune to raise more specific errors (TuneResourceError, etc) and catch/handle each of them at the tune-sklearn side.

However, this solution/problem isn't actually related to the issue originally posted (which is about a different type of error).

amogkam commented 3 years ago

@richardliaw Got it, that makes sense. So just to recap, currently:

when an error occurs during Tune execution
the underlying sklearn error is logged (to stderr)
but a TuneError is raised and silently caught allowing program execution to continue.

This PR makes the following change:

Tune execution now raises the underlying error directly (instead of logging it)
since these are not TuneErrors, they are not silently caught. This fixes the issue in #108.

However, as you pointed out in this PR, there are also explicit TuneErrors that don't have an underlying sklearn error (such as insufficient cluster resources). These are still silently caught allowing program execution to continue.

I propose that these TuneErrors should be raised as well (only a small change). And as you said, we want to raise more specific types of TuneErrors down the line.

richardliaw commented 3 years ago

@amogkam raising TuneErrors sounds good to me; we should create a followup issue to actually catch/repackage those TuneErrors though.

richardliaw commented 3 years ago

Can you ping when issues are addressed/tests are passing?

ray-project / tune-sklearn

Raise Errors on trial failures instead of logging them #130