ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.93k stars 5.57k forks source link

EOFError error during remote_worker_envs flags #46346

Open XavierGeerinck opened 2 months ago

XavierGeerinck commented 2 months ago

What happened + What you expected to happen

I am trying to get training to work while setting remote_worker_envs to true, but I am getting an EOFError

Traceback (most recent call last):
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/env_runner_group.py", line 169, in __init__
    self._setup(
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/env_runner_group.py", line 239, in _setup
    self.add_workers(
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/env_runner_group.py", line 799, in add_workers
    raise result.get()
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/utils/actor_manager.py", line 500, in _fetch_result
    result = ray.get(ready)
             ^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/worker.py", line 2639, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/_private/worker.py", line 866, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::SingleAgentEnvRunner.__init__() (pid=83332, ip=127.0.0.1, actor_id=1ca45720433ca900e1057f5801000000, repr=<ray.rllib.env.single_agent_env_runner.SingleAgentEnvRunner object at 0x149cc1210>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/single_agent_env_runner.py", line 79, in __init__
    self.make_env()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/ray/rllib/env/single_agent_env_runner.py", line 764, in make_env
    gym.vector.make(
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/__init__.py", line 82, in make
    return AsyncVectorEnv(env_fns) if asynchronous else SyncVectorEnv(env_fns)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/async_vector_env.py", line 169, in __init__
    self._check_spaces()
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/async_vector_env.py", line 504, in _check_spaces
    results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes])
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.venv/lib/python3.11/site-packages/gymnasium/vector/async_vector_env.py", line 504, in <listcomp>
    results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes])
                               ^^^^^^^^^^^
  File "/Users/xaviergeerinck/.pyenv/versions/3.11.8/lib/python3.11/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/Users/xaviergeerinck/.pyenv/versions/3.11.8/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/Users/xaviergeerinck/.pyenv/versions/3.11.8/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
    raise EOFError
EOFError

Versions / Dependencies

platform : macOS 14.2 23C64 (arm64) memory : 48.0 GB cpu : 16 cores mac : 6a:3c:67:54:c1:4d ip : 192.168.4.76 model_info : Mac15,9 (MUW73LL/A) kernel_version : 23.2.0 git_commit_sha : Unknown python_version : 3.11.8 (~/.venv/bin/python) pip_version : 24.0 (~/.venv/lib/python3.11/site-packages/pip) torch_version : 2.3.0 docker_version : 24.0.7, kubernetes_version : 1.28.2 ray_version : 2.31.0 nvidia_smi : Unknown, nvidia-smi was not found nvidia_cuda : Unknown, nvcc was not found is_tty : True

Reproduction script

.env_runners(
    num_env_runners=1,
    num_envs_per_env_runner=8,
    num_cpus_per_env_runner=1,
    num_gpus_per_env_runner=0,
    sample_timeout_s=60,
    remote_worker_envs=True,
    rollout_fragment_length="auto",
)

Issue Severity

High: It blocks me from completing my task.

simonsays1980 commented 1 month ago

@XavierGeerinck Thanks for raising this issue. Can you provide a reproducable example? I guess this might be in the new API stack which does not support asynchronous vector environments (yet - we wait for a gymnasium update).

XavierGeerinck commented 1 month ago

Awesome! We indeed are thinking the same and are awaiting the 1.0.0a2 release to start testing . Is there any ETA currently that you are aware of?

simonsays1980 commented 1 month ago

This should come soon, but we know of no ETA in regard to it. Would it help, for the time being to just sample with more Env Runners but a single env in each of them?