ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.49k stars 5.69k forks source link

[tune] reset_config questions #29268

Open spyroot opened 2 years ago

spyroot commented 2 years ago

Hi Folks,

I'm using tune.Tunner with callable that declared as function and getting error.

Trainable runner reuse requires reset_config() to be implemented and return True.

    tuner = tune.Tuner(
            tune.with_resources(
                    main,
                    resources={"cpu": 4, "gpu": 1}
            ),
            param_space=search_space,
            tune_config=tune.TuneConfig(
                    reuse_actors=True,
                    num_samples=10,
                    scheduler=ASHAScheduler(
                            metric="mean_accuracy",
                            mode="max",
                            grace_period=1,
                            reduction_factor=2
                    ),
            ),
    )

My understanding was you need to implement that only if you callable a Class and if you call tunner.fit() it has a default implementation.

Could you please clarify? Because it is super unclear if your callable is function, how it suppose to implement reset_config?

Versions / Dependencies

ray, version 2.0.0 Python 3.10

Reproduction script

def main(confg):
        model = SomeMode()
         model.to(device)

    # Create optimizer
    optim = torch.optim.Adam(model.parameters(), lr=lr)

    train_loop()
    report.

def foo():
  tuner = tune.Tuner(
          tune.with_resources(
                  main,
                  resources={"cpu": 4, "gpu": 1}
          ),
          param_space=search_space,
          tune_config=tune.TuneConfig(
                  reuse_actors=True,
                  num_samples=10,
                  scheduler=ASHAScheduler(
                          metric="mean_accuracy",
                          mode="max",
                          grace_period=1,
                          reduction_factor=2
                  ),
          ),
  )

Issue Severity

No response

anhnami commented 1 year ago

Facing this issue for 2 weeks, it interrupted my tune experiments occasionally. This is so painful. I'm using functional trainable.

Trial trainable_2da1b3a7 completed.                                                                                                                                                                                                                                                                                          
2023-03-10 08:06:09,847 WARNING util.py:244 -- The `reset` operation took 2.008 s, which may be a performance bottleneck.
2023-03-10 08:06:09,848 ERROR ray_trial_executor.py:682 -- Trial trainable_18a9fc4f: Error starting runner, aborting!
Traceback (most recent call last):
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 680, in start_trial
    return self._start_trial(trial)
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 521, in _start_trial
    runner = self._setup_remote_runner(trial)
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 394, in _setup_remote_runner
    existing_runner = self._maybe_use_cached_actor(trial, logger_creator)
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 370, in _maybe_use_cached_actor
    raise _AbortTrialExecution(
ray.tune.error._AbortTrialExecution: Trainable runner reuse requires reset_config() to be implemented and return True.
2023-03-10 08:06:11,860 WARNING util.py:244 -- The `start_trial` operation took 4.023 s, which may be a performance bottleneck.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 680, in start_trial
    return self._start_trial(trial)
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 521, in _start_trial
    runner = self._setup_remote_runner(trial)
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 394, in _setup_remote_runner
    existing_runner = self._maybe_use_cached_actor(trial, logger_creator)
  File "/home/xxxxx/miniconda3/envs/dev/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 370, in _maybe_use_cached_actor
    raise _AbortTrialExecution(
ray.tune.error._AbortTrialExecution: Trainable runner reuse requires reset_config() to be implemented and return True. 
anhnami commented 1 year ago

Is there any update on this problem?

PhilippWillms commented 1 year ago

I am also facing that topic today. It must be related to the reuse_actors=True setting. What I can provide today is the library versions combinations I used.

llkongs commented 1 year ago

TuneError: Traceback (most recent call last): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 627, in start_trial return self._start_trial(trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 479, in _start_trial runner = self._setup_remote_runner(trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 390, in _setup_remote_runner existing_runner = self._maybe_use_cached_actor(trial, logger_creator) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 366, in _maybe_use_cached_actor "Trainable runner reuse requires reset_config() to be " ray.tune.error._AbortTrialExecution: Trainable runner reuse requires reset_config() to be implemented and return True.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1373, in _wait_and_handle_event self._on_pg_ready(next_trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1427, in _on_pg_ready trial_started = _start_trial(next_trial) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1417, in _start_trial if self.trial_executor.start_trial(trial): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 631, in start_trial self._stop_trial(trial, exc=e) File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 591, in _stop_trial if not error and self._maybe_cache_trial_actor(trial): File "/opt/mlx_deploy/miniconda3/envs/mlx/lib/python3.7/site-packages/ray/tune/execution/ray_trial_executor.py", line 545, in _maybe_cache_trial_actor acquired_resources = self._trial_to_acquired_resources[trial] KeyError: train_model_9f259_00099

same issue, and occured occasionally, any update?

grizzlybearg commented 1 year ago

Any update on this?

jakemdaly commented 1 year ago

I'm running into this same issue as well. I also only notice it when I do longer train sessions.... I tried reproducing by decreasing the number of epochs and training samples to something more trivial, and I don't notice this issue occurring

justinvyu commented 11 months ago

These errors seem to be coming from Tune's old execution engine in ray<2.5 -- are you also running into this with the latest version of Ray (2.7.1 as of now)? @jakemdaly @grizzlybearg @llkongs

Laiaborrell commented 11 months ago

These errors seem to be coming from Tune's old execution engine in ray<2.5 -- are you also running into this with the latest version of Ray (2.7.1 as of now)? @jakemdaly @grizzlybearg @llkongs

Hi! I am using Ray version 2.8 and I got this problems from time to time too during my trainings

m-walters commented 11 months ago

Also finding this issue on batch of runs, version 2.8.0. Causes about 1/5 runs to error which is a lot!

justinvyu commented 11 months ago

For now, you should work around this with tune_config=tune.TuneConfig(reuse_actors=False). It would help a lot if anyone could provide a minimal reproduction here!

m-walters commented 11 months ago

Thanks justin I tried to reproduce by issuing num_samples=3 of the hyperparameters of the failing run, but they succeeded.

justinvyu commented 10 months ago

@m-walters Could you provide the code that was used to cause 1/5 runs to error?

a416297338 commented 3 weeks ago

@justinvyu Many people have encountered trouble as follows: https://discuss.ray.io/t/correct-implementation-for-ppo-reset-config/14307