ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.23k stars 5.81k forks source link

[Tune] Exceptions when setting `max_concurrent_trials` > 1. #47868

Open elistevens opened 2 months ago

elistevens commented 2 months ago

What happened + What you expected to happen

There seems to be some sort of race condition when using Ray Tune. I get the following two errors intermittently when running a ray tune search on my local system and using tune_config.max_concurrent_trials > 1. The problem does not occur when setting max_concurrent_trials = 1.

Here are the two stack traces produced

ray.exceptions.RayTaskError(FileNotFoundError): ray::ImplicitFunc.train() (pid=217773, ip=10.67.49.102, actor_id=c467ba7db070489d6ce6c1aa01000000, repr=func)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/air/_internal/util.py", line 104, in run
    self._ret = self._target(*self._args, **self._kwargs)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
    training_func=lambda: self._trainable_func(self.config),
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/trainable/function_trainable.py", line 250, in _trainable_func
    output = fn()
  File "/out_dir/zoox/ml/platform/ztrain/components/ray_entrypoint.py", line 223, in __call__
    return ray_trainer.fit().metrics
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/train/base_trainer.py", line 623, in fit
    result_grid = tuner.fit()
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/tuner.py", line 377, in fit
    return self._local_tuner.fit()
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/impl/tuner_internal.py", line 476, in fit
    analysis = self._fit_internal(trainable, param_space)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/impl/tuner_internal.py", line 592, in _fit_internal
    analysis = run(
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/tune.py", line 994, in run
    runner.step()
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/execution/tune_controller.py", line 701, in step
    raise e
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/execution/tune_controller.py", line 698, in step
    self.checkpoint()
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint
    self._checkpoint_manager.sync_up_experiment_state(
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state
    save_fn()
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir
    self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir
    _atomic_save(
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save
    os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2024-09-26_12-08-20_923491_216883/artifacts/2024-09-26_12-08-45/foo/bar:baz/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2024-09-26_12-08-20_923491_216883/artifacts/2024-09-26_12-08-45/foo/bar:baz/driver_artifacts/basic-variant-state-2024-09-26_12-08-45.json'

I tried to overwrite the _atomic_save function to use a more specific file name than .tmp_generator, but that did not meaningfully change the behavior. My next guess would be that a directory is being removed, but I have no evidence of that.

2024-09-26 12:11:31,153 ERROR tune_controller.py:1331 -- Trial task failed for trial DEFAULT_b4049_00002
Traceback (most recent call last):
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/_private/worker.py", line 2691, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TrainingFailedError): ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::TrainTrainable.__init__() (pid=220138, ip=10.67.49.102, actor_id=e0fa49e5e254661a535b166601000000, repr=TorchTrainer)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/trainable/util.py", line 119, in setup
    setup_kwargs[k] = parameter_registry.get(prefix + k)
  File "/out_dir/python3_deps_pypi__ray/site-packages/ray/tune/registry.py", line 306, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00590e2d6bd6f843ae0846a446f5d0aa0d8bfd160100000003e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*e004e098d85d4419019918bca0851d84a39ec15e75ec5f759532a141*` at IP address 10.67.49.102) for more information about the Python worker failure.

The cluster logs mentioned don't contain additional relevant information, nor did RAY_record_ref_creation_sites=1 offer additional info.

Versions / Dependencies

Python 3.10.14 Ray 2.10 and 2.37

Reproduction script

I don't have a minimal repro script.

It's possible that part of the issue is because my ray jobs are either underspecified, or are cheap enough that ray is trying to do "too much" on the local node all at once. I would see multiple tune jobs running at once, even though only one training job fits on the GPU.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

justinvyu commented 1 month ago

@elistevens Are you running multiple Tune jobs concurrently, and are they launched at the same time? Could you try specifying a unique RunConfig(name) for each new job?

elistevens commented 1 month ago

Are you running multiple Tune jobs concurrently

To avoid potential miscommunication, I'll be specific: I'm only invoking the program once at a time, and the program only calls ray.tune.Tuner once. The thing being tuned is a GPU-based job (training MNIST, requesting 4 cores and 1 gpu) that only runs one at a time on my local system (8 cores, 1 gpu).

I get the impression that there are some lightweight tuner processes that get kicked off, which do some setup work but are mostly idle waiting for their turn to kick off a training run. I think these are in parallel when max_concurrent_trials isn't 1, but I haven't really confirmed that.

I don't change the RunConfig(name) per separate invocation of the program. Do I need to, if the sequential invocations aren't running at the same time?

alex-dr commented 1 week ago

I have also run into this.

See also https://github.com/ray-project/ray/issues/41362

We are using Ray Tune on a cluster with max_concurrent_trials=100 and run into this occasionally.

We are working around by disabling checkpointing, as we don't need that feature.

But the root of the issue is that _atomic_save is not thread-safe or multiprocess-safe, but is being used in what appears to be a multithreading context (checkpointing after multiple trials complete).

    tmp_search_ckpt_path = os.path.join(checkpoint_dir, tmp_file_name)
    with open(tmp_search_ckpt_path, "wb") as f:
        cloudpickle.dump(state, f)
    # another thread from the same driver program jumps in and calls `os.replace` first
    os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))  # Now this line fails

The obvious options for solutions are lockfiles or randomized filenames.