Closed amogkam closed 3 years ago
cc @richardliaw @inventormc is this enough to fix https://github.com/ray-project/tune-sklearn/issues/108? If so, I can add a test.
Hi @amogkam, for this PR, can you be sure to post an example script and example output of what we expect to see?
I went ahead and ran this with:
In [1]: # from sklearn.model_selection import GridSearchCV
...: from tune_sklearn import TuneGridSearchCV
...: from sklearn.ensemble import RandomForestClassifier
...: # Other imports
...: import time
...: import numpy as np
...: from sklearn.datasets import make_classification
...: from sklearn.model_selection import train_test_split
...: from sklearn.linear_model import SGDClassifier
...:
...: # Set training and validation sets
...: X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
...: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
...:
...: # Example parameters to tune from SGDClassifier
...: param_grid = {
...: 'n_estimators': [100,200,300],
...: 'max_depth':range(5,30,5),
...: 'min_samples_split':[1,2,5,10,15],
...: 'min_samples_leaf':[1,2,5,10],
...: 'max_features': ['log2','sqrt']
...: }
...: forest_clf = RandomForestClassifier(random_state=42,warm_start=True)
...:
...: grid_search = TuneGridSearchCV(forest_clf, param_grid, cv=5,scoring='accuracy',use_gpu=True)
...:
...: start=time.time()
...: grid_search.fit(X_train, y_train)
...: end=time.time()
...: print('Tune time: ',end-start)
...: score=grid_search.score(X_test,y_test)
...: print("Tune Score:", score)
And this is the stacktrace that I got:
---------------------------------------------------------------------------
TuneError Traceback (most recent call last)
<ipython-input-1-f83e200c8718> in <module>
26
27 start=time.time()
---> 28 grid_search.fit(X_train, y_train)
29 end=time.time()
30 print('Tune time: ',end-start)
~/dev/tune-sklearn/tune_sklearn/tune_basesearch.py in fit(self, X, y, groups, **fit_params)
658 if not ray_init and ray.is_initialized():
659 ray.shutdown()
--> 660 raise e
661
662 def score(self, X, y=None):
~/dev/tune-sklearn/tune_sklearn/tune_basesearch.py in fit(self, X, y, groups, **fit_params)
648 "To show process output, set verbose=2.")
649
--> 650 result = self._fit(X, y, groups, **fit_params)
651
652 if not ray_init and ray.is_initialized():
~/dev/tune-sklearn/tune_sklearn/tune_basesearch.py in _fit(self, X, y, groups, **fit_params)
549
550 self._fill_config_hyperparam(config)
--> 551 analysis = self._tune_run(config, resources_per_trial)
552
553 self.cv_results_ = self._format_results(self.n_splits, analysis)
~/dev/tune-sklearn/tune_sklearn/tune_gridsearch.py in _tune_run(self, config, resources_per_trial)
254 resources_per_trial=resources_per_trial,
255 local_dir=os.path.expanduser(self.local_dir),
--> 256 loggers=self.loggers)
257
258 return analysis
~/miniconda3/lib/python3.7/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, loggers, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint)
409 tune_start = time.time()
410 while not runner.is_finished():
--> 411 runner.step()
412 if verbose:
413 _report_progress(runner, progress_reporter)
~/miniconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py in step(self)
570 self._process_events() # blocking
571 else:
--> 572 self.trial_executor.on_no_available_trials(self)
573
574 self._stop_experiment_if_needed()
~/miniconda3/lib/python3.7/site-packages/ray/tune/trial_executor.py in on_no_available_trials(self, trial_runner)
175 )
176 raise TuneError(
--> 177 "Insufficient cluster resources to launch trial: "
178 f"trial requested {resource_string}, but the cluster "
179 f"has only {self.resource_string()}. "
TuneError: Insufficient cluster resources to launch trial: trial requested 1 CPUs, 1 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 25.34 GiB heap, 8.74 GiB objects (1.0 node:192.168.1.115).
I agree that this stacktrace is a better error message than before. However, there's a lot of context in the error message that makes it hard to parse. For the scikit-learn user, they will not know about Ray's resource management protocols. Hence, we'll want to raise an error like "We detected use_gpus
but no GPUs found."
@richardliaw thanks, this makes sense for this specific case, but we would have to write custom error messages for all TuneErrors right? Is there a more general solution?
The long-term solution would be to update Tune to raise more specific errors (TuneResourceError, etc) and catch/handle each of them at the tune-sklearn side.
However, this solution/problem isn't actually related to the issue originally posted (which is about a different type of error).
@richardliaw Got it, that makes sense. So just to recap, currently:
TuneError
is raised and silently caught allowing program execution to continue.This PR makes the following change:
TuneErrors
, they are not silently caught. This fixes the issue in #108.However, as you pointed out in this PR, there are also explicit TuneErrors that don't have an underlying sklearn error (such as insufficient cluster resources). These are still silently caught allowing program execution to continue.
I propose that these TuneErrors should be raised as well (only a small change). And as you said, we want to raise more specific types of TuneErrors down the line.
@amogkam raising TuneErrors sounds good to me; we should create a followup issue to actually catch/repackage those TuneErrors though.
Can you ping when issues are addressed/tests are passing?
This PR raises underlying errors on trial failure instead of simply just logging them and catching the exception.
Closes #108.
Depends on https://github.com/ray-project/ray/pull/11842
When running the following code:
this is the stacktrace without this PR:
Notice how the errors are logged, but not raised so program execution still continues.
With this PR: