Closed jmakov closed 2 years ago
Doesn't crash if os.environ["TUNE_RESULT_BUFFER_LENGTH"] = "0"
is commented out.
Do you have a script we can try ourselves? Otherwise it's hard to fix this
@krfricke here you go:
import os
import random
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1" # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1" # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_RESULT_BUFFER_LENGTH"] = "0"
import numpy as np
import ray
from ray import tune
from ray.tune.suggest import optuna
def evaluation_fn():
return random.randint(1, 10_000)
def easy_objective(config, data):
intermediate_score = evaluation_fn()
tune.report(mean_loss=intermediate_score)
if __name__ == "__main__":
ray.init(address='auto', _redis_password='xxx')
df = np.zeros(10_000_000)
search_optuna = optuna.OptunaSearch()
analysis = tune.run(
tune.with_parameters(easy_objective, data=df),
name="test",
metric="mean_loss",
mode="max",
search_alg=search_optuna,
num_samples=-1,
config={
"width": tune.uniform(0, 20),
"height": tune.uniform(-100, 100)
},
reuse_actors=True,
fail_fast=True,
verbose=1
)
On my 3 node cluster it crashes with:
== Status ==
Memory usage on this node: 15.1/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 51.0/52 CPUs, 0/2 GPUs, 0.0/102.14 GiB heap, 0.0/47.77 GiB objects (0.0/1.0 accelerator_type:GT, 0.0/1.0 accelerator_type:G)
Current best trial: a485397c with mean_loss=10000 and parameters={'width': 7.240691732613056, 'height': -58.746246442990405}
Result logdir: /home/toaster/ray_results/test
Number of trials: 27650/infinite (1 PENDING, 51 RUNNING, 27598 TERMINATED)
2021-10-14 20:42:07,566 ERROR trial_runner.py:846 -- Trial easy_objective_57d58564: Error processing event.
Traceback (most recent call last):
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 820, in _process_trial
decision = self._process_trial_result(trial, result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 873, in _process_trial_result
trial.trial_id, result=flat_result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete
trial_id=trial_id, result=result, error=error)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
self._ot_study.tell(ot_trial, val, state=ot_trial_state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
self._storage.set_trial_values(trial_id, values)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
self.check_trial_is_updatable(trial_id, trial.state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
"Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#5562 has already finished and can not be updated.
Traceback (most recent call last):
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 820, in _process_trial
decision = self._process_trial_result(trial, result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 873, in _process_trial_result
trial.trial_id, result=flat_result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete
trial_id=trial_id, result=result, error=error)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
self._ot_study.tell(ot_trial, val, state=ot_trial_state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
self._storage.set_trial_values(trial_id, values)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
self.check_trial_is_updatable(trial_id, trial.state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
"Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#5562 has already finished and can not be updated.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 44, in <module>
verbose=1
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/tune.py", line 588, in run
runner.step()
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 627, in step
self._process_events(timeout=timeout)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 785, in _process_events
self._process_trial(trial)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 847, in _process_trial
self._process_trial_failure(trial, traceback.format_exc())
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 1058, in _process_trial_failure
self._search_alg.on_trial_complete(trial.trial_id, error=True)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete
trial_id=trial_id, result=result, error=error)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
self._ot_study.tell(ot_trial, val, state=ot_trial_state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 664, in tell
self._storage.set_trial_state(trial_id, state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 223, in set_trial_state
self.check_trial_is_updatable(trial_id, trial.state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
"Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#5562 has already finished and can not be updated.
This is still present in ray 1.7.1 and nightly. This is a complete blocker since not only is Tune already slow (see https://github.com/ray-project/ray/issues/18903#issuecomment-951504211 - x10 slower after a few hours), but it crashes after a few hours on the cluster.
Sorry for the late reply.
I couldn't reproduce the error with various settings for buffered results. However, we've cleaned up Ray Tune's buffering in the past months, so it might be that the error is resolved by some of those changes.
The core problem here lies in how Optuna handles duplicate results. Of course, trial completion shouldn't be called twice anyways.
https://github.com/ray-project/ray/pull/23495 addresses these problems so that Optuna wouldn't crash after receiving these results.
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
After upgrade to ray 1.7.0 (from 1.6.0), my script exits with an exception (previously only warnings were there).
Reproduction script
Script is using:
Warnings and exception from the script:
Anything else
Conda's env.yaml:
Are you willing to submit a PR?