Open spyroot opened 2 years ago
Facing this issue for 2 weeks, it interrupted my tune experiments occasionally. This is so painful. I'm using functional trainable.
Trial trainable_2da1b3a7 completed.
2023-03-10 08:06:09,847 WARNING util.py:244 -- The `reset` operation took 2.008 s, which may be a performance bottleneck.
2023-03-10 08:06:09,848 ERROR ray_trial_executor.py:682 -- Trial trainable_18a9fc4f: Error starting runner, aborting!
Traceback (most recent call last):
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 680, in start_trial
return self._start_trial(trial)
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 521, in _start_trial
runner = self._setup_remote_runner(trial)
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 394, in _setup_remote_runner
existing_runner = self._maybe_use_cached_actor(trial, logger_creator)
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 370, in _maybe_use_cached_actor
raise _AbortTrialExecution(
ray.tune.error._AbortTrialExecution: Trainable runner reuse requires reset_config() to be implemented and return True.
2023-03-10 08:06:11,860 WARNING util.py:244 -- The `start_trial` operation took 4.023 s, which may be a performance bottleneck.
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 680, in start_trial
return self._start_trial(trial)
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 521, in _start_trial
runner = self._setup_remote_runner(trial)
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 394, in _setup_remote_runner
existing_runner = self._maybe_use_cached_actor(trial, logger_creator)
File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 370, in _maybe_use_cached_actor
raise _AbortTrialExecution(
ray.tune.error._AbortTrialExecution: Trainable runner reuse requires reset_config() to be implemented and return True.
Is there any update on this problem?
I am also facing that topic today. It must be related to the reuse_actors=True
setting. What I can provide today is the library versions combinations I used.
TuneError: Traceback (most recent call last): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 627, in start_trial return self._start_trial(trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 479, in _start_trial runner = self._setup_remote_runner(trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 390, in _setup_remote_runner existing_runner = self._maybe_use_cached_actor(trial, logger_creator) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 366, in _maybe_use_cached_actor "Trainable runner reuse requires reset_config() to be " ray.tune.error._AbortTrialExecution: Trainable runner reuse requires reset_config() to be implemented and return True.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1373, in _wait_and_handle_event self._on_pg_ready(next_trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1427, in _on_pg_ready trial_started = _start_trial(next_trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1417, in _start_trial if self.trial_executor.start_trial(trial): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 631, in start_trial self._stop_trial(trial, exc=e) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 591, in _stop_trial if not error and self._maybe_cache_trial_actor(trial): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 545, in _maybe_cache_trial_actor acquired_resources = self._trial_to_acquired_resources[trial] KeyError: train_model_9f259_00099
same issue, and occured occasionally, any update?
Any update on this?
I'm running into this same issue as well. I also only notice it when I do longer train sessions.... I tried reproducing by decreasing the number of epochs and training samples to something more trivial, and I don't notice this issue occurring
These errors seem to be coming from Tune's old execution engine in ray<2.5 -- are you also running into this with the latest version of Ray (2.7.1 as of now)? @jakemdaly @grizzlybearg @llkongs
These errors seem to be coming from Tune's old execution engine in ray<2.5 -- are you also running into this with the latest version of Ray (2.7.1 as of now)? @jakemdaly @grizzlybearg @llkongs
Hi! I am using Ray version 2.8 and I got this problems from time to time too during my trainings
Also finding this issue on batch of runs, version 2.8.0. Causes about 1/5 runs to error which is a lot!
For now, you should work around this with tune_config=tune.TuneConfig(reuse_actors=False)
. It would help a lot if anyone could provide a minimal reproduction here!
Thanks justin
I tried to reproduce by issuing num_samples=3
of the hyperparameters of the failing run, but they succeeded.
@m-walters Could you provide the code that was used to cause 1/5 runs to error?
@justinvyu Many people have encountered trouble as follows: https://discuss.ray.io/t/correct-implementation-for-ppo-reset-config/14307
Hi Folks,
I'm using tune.Tunner with callable that declared as function and getting error.
Trainable runner reuse requires reset_config() to be implemented and return True.
My understanding was you need to implement that only if you callable a Class and if you call tunner.fit() it has a default implementation.
Could you please clarify? Because it is super unclear if your callable is function, how it suppose to implement reset_config?
Versions / Dependencies
ray, version 2.0.0 Python 3.10
Reproduction script
Issue Severity
No response