TuneError: Insufficient cluster resources to launch trial: trial requested

MislavSag commented 4 years ago

I have just tried TuneSearchCV function with search_optimization='bayesian' option:

    param_bayes = {
        "n_estimators": (50, 1000),
        "max_depth": (2, 7),
        'max_features': (1, 30)
        # 'min_weight_fraction_leaf': (0.03, 0.1, 'uniform')
    }

    # clf = joblib.load("rf_model.pkl")
    rf = RandomForestClassifier(criterion='entropy',
                                class_weight='balanced_subsample',
                                random_state=rand_state)

    # tune search    
    tune_search = TuneSearchCV(
        rf,
        param_bayes,
        search_optimization='bayesian',
        max_iters=10,
        scoring='f1',
        n_jobs=16,
        cv=cv,
        verbose=1
    )

    tune_search.fit(X_train, y_train, sample_weight=sample_weigths)

I get the following output:

== Status ==
Memory usage on this node: 23.8/31.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 16/32 CPUs, 0/0 GPUs, 0.0/7.28 GiB heap, 0.0/2.49 GiB objects
Result logdir: C:\Users\Mislav\ray_results\_Trainable
Number of trials: 10 (1 ERROR, 9 PENDING)

Trial name | status | loc | max_depth | max_features | n_estimators
-- | -- | -- | -- | -- | --
_Trainable_f5813982 | ERROR |   | 4 | 14 | 54
_Trainable_f5824ae6 | PENDING |   | 6 | 26 | 317
_Trainable_f5835c58 | PENDING |   | 7 | 7 | 863
_Trainable_f5846dcc | PENDING |   | 5 | 8 | 667
_Trainable_f5855824 | PENDING |   | 3 | 20 | 82
_Trainable_f586699c | PENDING |   | 5 | 6 | 516
_Trainable_f5877b08 | PENDING |   | 4 | 29 | 435
_Trainable_f5888c7a | PENDING |   | 4 | 26 | 823
_Trainable_f5899df8 | PENDING |   | 6 | 14 | 68
_Trainable_f58aaf58 | PENDING |   | 6 | 12 | 292

ERROR:ray.tune.ray_trial_executor:Trial _Trainable_f5824ae6: Unexpected error starting runner.
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 294, in start_trial
    self._start_trial(trial, checkpoint, train=train)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 233, in _start_trial
    runner = self._setup_remote_runner(trial, reuse_allowed)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 129, in _setup_remote_runner
    trial.init_logger()
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial.py", line 318, in init_logger
    self.local_dir)
  File "C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial.py", line 310, in create_logdir
    dir=local_dir)
  File "C:\ProgramData\Anaconda3\lib\tempfile.py", line 366, in mkdtemp
    _os.mkdir(file, 0o700)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\Users\\Mislav\\ray_results\\_Trainable\\_Trainable_2_X_id=ObjectID(ffffffffffffffffffffffff0100008013000000),cv=PurgedKFold(n_splits=4, pct_embargo=0.0,\n      samples_inf_2020-07-21_08-32-35vlc0o8vf'
WARNING:ray.tune.utils.util:The `start_trial` operation took 2.002230405807495 seconds to complete, which may be a performance bottleneck.

---------------------------------------------------------------------------
TuneError                                 Traceback (most recent call last)
c:\Users\Mislav\Documents\GitHub\trademl\trademl\modeling\train_rf_sklearnopt.py in <module>
----> <a href='file://c:\Users\Mislav\Documents\GitHub\trademl\trademl\modeling\train_rf_sklearnopt.py?line=228'>229</a> tune_search.fit(X_train, y_train, sample_weight=sample_weigths)

C:\ProgramData\Anaconda3\lib\site-packages\tune_sklearn\tune_basesearch.py in fit(self, X, y, groups, **fit_params)
    366                 ray.init(ignore_reinit_error=True, configure_logging=False)
    367 
--> 368             result = self._fit(X, y, groups, **fit_params)
    369 
    370             if not ray_init and ray.is_initialized():

C:\ProgramData\Anaconda3\lib\site-packages\tune_sklearn\tune_basesearch.py in _fit(self, X, y, groups, **fit_params)
    320 
    321         self._fill_config_hyperparam(config)
--> 322         analysis = self._tune_run(config, resources_per_trial)
    323 
    324         self.cv_results_ = self._format_results(self.n_splits, analysis)

C:\ProgramData\Anaconda3\lib\site-packages\tune_sklearn\tune_search.py in _tune_run(self, config, resources_per_trial)
    337                 fail_fast=True,
    338                 checkpoint_at_end=True,
--> 339                 resources_per_trial=resources_per_trial)
    340 
    341         return analysis

C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\tune.py in run(run_or_experiment, name, stop, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, loggers, sync_to_cloud, sync_to_driver, checkpoint_freq, checkpoint_at_end, sync_on_checkpoint, keep_checkpoints_num, checkpoint_score_attr, global_checkpoint_period, export_formats, max_failures, fail_fast, restore, search_alg, scheduler, with_server, server_port, verbose, progress_reporter, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, return_trials, ray_auto_init)
    325 
    326     while not runner.is_finished():
--> 327         runner.step()
    328         if verbose:
    329             _report_progress(runner, progress_reporter)

C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial_runner.py in step(self)
    340             self._process_events()  # blocking
    341         else:
--> 342             self.trial_executor.on_no_available_trials(self)
    343 
    344         self._stop_experiment_if_needed()

C:\ProgramData\Anaconda3\lib\site-packages\ray\tune\trial_executor.py in on_no_available_trials(self, trial_runner)
    173                              self.resource_string(),
    174                              trial.get_trainable_cls().resource_help(
--> 175                                  trial.config)))
    176             elif trial.status == Trial.PAUSED:
    177                 raise TuneError("There are paused trials, but no more pending "

TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 32 CPUs, 0 GPUs, 7.28 GiB heap, 2.49 GiB objects (1.0 node:192.168.1.4). Pass `queue_trials=True` in ray.tune.run() or on the command line to queue trials until the cluster scales up or resources become available.

richardliaw commented 4 years ago

Thanks for opening this issue! This is a bug on our side and we will fix this ASAP.

I think the fix right now is just to do n_jobs=None.

richardliaw commented 4 years ago

BTW @MislavSag is this possible to reproduce on colab?

MislavSag commented 4 years ago

Don't have time right now, I am going to vacation tomorrow. Maybe in 2 weeks if that's not too late.

richardliaw commented 4 years ago

No problem @MislavSag; enjoy your vacation and stay safe!

nasrin136 commented 4 years ago

@richardliaw I get this error with tune.run. I have a machine with 4 cpus and 1 gpu. I initiate ray with cpu=3 and gpu=1 and from within tune.run, resources_per_trial={"cpu": 1, "gpu": 0.5}. At first, no error. 2 trails run parallel and around 20 trails get completes, however, after a while, I get this error:

"ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 1 CPUs, 0.5 GPUs but the cluster has only 0 CPUs, 0 GPUs, 0.0 GiB heap, 0.0 GiB objects. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up or resources become available."

Is this because of the bug you have pointed out earlier in the thread?

richardliaw commented 4 years ago

@nasrin136 thanks for reaching out!

No, that sounds like another problem (actually more concerning). Could you post this issue on ray-project/ray and ping me when you do?

nasrin136 commented 4 years ago

Thanks @richardliaw for the prompt response. Done. and here is the link: https://github.com/ray-project/ray/issues/10756

richardliaw commented 4 years ago

closing because stale/addressed

ray-project / tune-sklearn

TuneError: Insufficient cluster resources to launch trial: trial requested #51