ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

[Core] ray.exceptions.RaySystemError: System error: Broken pipe #37197

Open kostrykin opened 1 year ago

kostrykin commented 1 year ago

What happened + What you expected to happen

  1. Bug: The error ray.exceptions.RaySystemError: System error: Broken pipe is raised when using ray.put.
  2. Expected behavior: The code runs through without error.
  3. Useful information: See below.

The console output when running the reproduction script:

(raylet) [2023-07-07 16:34:25,145 E 2740582 2740628] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
Traceback (most recent call last):
  File "~/Documents/ray-test/test-ray.py", line 11, in <module>
    ray.put(img)
  File "~/.anaconda3/envs/ray-test/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "~/.anaconda3/envs/ray-test/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "~/.anaconda3/envs/ray-test/lib/python3.10/site-packages/ray/_private/worker.py", line 2612, in put
    object_ref = worker.put_object(value, owner_address=serialize_owner_address)
  File "~/.anaconda3/envs/ray-test/lib/python3.10/site-packages/ray/_private/worker.py", line 710, in put_object
    self.core_worker.put_serialized_object_and_increment_local_ref(
  File "python/ray/_raylet.pyx", line 2707, in ray._raylet.CoreWorker.put_serialized_object_and_increment_local_ref
  File "python/ray/_raylet.pyx", line 2596, in ray._raylet.CoreWorker._create_put_buffer
  File "python/ray/_raylet.pyx", line 456, in ray._raylet.check_status
ray.exceptions.RaySystemError: System error: Broken pipe

The reproduction script reported below is a minimal example, simplifying it any further (e.g., using range(2) instead of range(3) in the code below) eliminates the error.

I have also found two possible work-arounds:

  1. Use the dependency versions python =3.9.16 and ray-core =1.6.0 instead of the versions reported below.
  2. Use the dependency versions python =3.8.5, numpy =1.20.3, scipy =1.6.3, ray-core =2.3.0 instead of the versions reported below.

The computer which I used for testing was equipped with 32 GiB of RAM.

Versions / Dependencies

OS: Ubuntu 20.04.6

Dependencies:

python =3.10.12
numpy =1.25.0
scipy =1.11.1
ray-core =2.5.1

Reproduction script

import numpy as np
import ray
import scipy.ndimage as ndi

ray.init(num_cpus=1)
img = np.zeros((1344, 1024))

for _ in range(3):
    ndi.gaussian_laplace(img, 200)

ray.put(img)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jjyao commented 1 year ago

Are you using conda Ray package?

kostrykin commented 1 year ago

Are you using conda Ray package?

@jjyao Yes, from conda-forge.

rynewang commented 1 year ago

Trying to reproduce the issue but not successful yet. Would you mind helping by running the repro script again, and post output of these commands?

pip freeze | grep grpcio

and

cat /tmp/ray/session_latest/logs/dashboard_agent*

rynewang commented 1 year ago

My env:

Macbook M1

python==3.10.12 (conda)
numpy==1.25.0 (conda)
scipy==1.11.1 (pip, conda version is missing liblapack.3.dylib)
ray==2.5.1 (pip, conda does not have macos package)

and ran the repro script without a problem.

kostrykin commented 1 year ago

Trying to reproduce the issue but not successful yet. Would you mind helping by running the repro script again, and post output of these commands?

pip freeze | grep grpcio

grpcio @ file:///home/conda/feedstock_root/build_artifacts/grpc-split_1675287624183/work

and

cat /tmp/ray/session_latest/logs/dashboard_agent*

2023-07-19 10:46:39,997 INFO agent.py:117 -- Parent pid is 2999273
2023-07-19 10:46:39,998 INFO agent.py:143 -- Dashboard agent grpc address: 0.0.0.0:59997
2023-07-19 10:46:39,999 INFO utils.py:112 -- Get all modules by type: DashboardAgentModule
2023-07-19 10:46:40,000 INFO utils.py:123 -- Module ray.dashboard.modules.actor.actor_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,051 INFO utils.py:123 -- Module ray.dashboard.modules.event.event_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,052 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,052 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,052 INFO utils.py:123 -- Module ray.dashboard.modules.job.cli cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'pydantic'
2023-07-19 10:46:40,053 INFO utils.py:123 -- Module ray.dashboard.modules.job.job_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,053 INFO utils.py:123 -- Module ray.dashboard.modules.job.job_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,054 INFO utils.py:123 -- Module ray.dashboard.modules.job.job_manager cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'pydantic'
2023-07-19 10:46:40,054 INFO utils.py:123 -- Module ray.dashboard.modules.job.pydantic_models cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'pydantic'
2023-07-19 10:46:40,055 INFO utils.py:123 -- Module ray.dashboard.modules.log.log_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,055 INFO utils.py:123 -- Module ray.dashboard.modules.log.log_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,067 INFO utils.py:123 -- Module ray.dashboard.modules.log.log_manager cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'pydantic'
2023-07-19 10:46:40,070 INFO utils.py:123 -- Module ray.dashboard.modules.metrics.metrics_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,071 INFO utils.py:123 -- Module ray.dashboard.modules.node.node_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,071 INFO utils.py:123 -- Module ray.dashboard.modules.reporter.reporter_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
2023-07-19 10:46:40,072 INFO utils.py:123 -- Module ray.dashboard.modules.reporter.reporter_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,073 INFO utils.py:123 -- Module ray.dashboard.modules.serve.serve_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,073 INFO utils.py:123 -- Module ray.dashboard.modules.serve.serve_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,074 INFO utils.py:123 -- Module ray.dashboard.modules.snapshot.snapshot_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,074 INFO utils.py:123 -- Module ray.dashboard.modules.state.state_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,074 INFO utils.py:123 -- Module ray.dashboard.modules.test.test_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,075 INFO utils.py:123 -- Module ray.dashboard.modules.test.test_head cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'aiohttp'
2023-07-19 10:46:40,075 INFO utils.py:123 -- Module ray.dashboard.modules.test.test_utils cannot be loaded because we cannot import all dependencies. Install this module using `pip install 'ray[default]'` for the full dashboard functionality. Error: No module named 'async_timeout'
2023-07-19 10:46:40,076 INFO utils.py:145 -- Available modules: [<class 'ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent'>]
2023-07-19 10:46:40,076 INFO agent.py:172 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent'>
2023-07-19 10:46:40,076 INFO agent.py:177 -- Loaded 1 modules.

My env:

Macbook M1

python==3.10.12 (conda)
numpy==1.25.0 (conda)
scipy==1.11.1 (pip, conda version is missing liblapack.3.dylib)
ray==2.5.1 (pip, conda does not have macos package)

and ran the repro script without a problem.

Maybe it's due to the OS? As I reported in the issue, I am using Ubuntu 20.04.6. Or due to scipy and ray being installed via pip instead of Conda?

Here is the full Conda environment, obtained by conda list, including all dependencies of the packages listed in the issue and their dependencies:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
attrs                     23.1.0             pyh71513ae_1    conda-forge
brotli-python             1.0.9           py310hd8f1fbe_9    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.19.1               hd590300_0    conda-forge
ca-certificates           2023.5.7             hbcca054_0    conda-forge
certifi                   2023.5.7           pyhd8ed1ab_0    conda-forge
charset-normalizer        3.2.0              pyhd8ed1ab_0    conda-forge
click                     8.1.6           unix_pyh707e725_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
filelock                  3.12.2             pyhd8ed1ab_0    conda-forge
frozenlist                1.4.0           py310h2372a71_0    conda-forge
grpc-cpp                  1.48.1               h4fad500_3    conda-forge
grpcio                    1.48.1          py310h4a5735c_3    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib_resources       6.0.0              pyhd8ed1ab_1    conda-forge
jsonschema                4.18.4             pyhd8ed1ab_0    conda-forge
jsonschema-specifications 2023.7.1           pyhd8ed1ab_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libabseil                 20220623.0      cxx17_h05df665_6    conda-forge
libblas                   3.9.0           17_linux64_openblas    conda-forge
libcblas                  3.9.0           17_linux64_openblas    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.1.0               he5830b7_0    conda-forge
libgfortran-ng            13.1.0               h69a702a_0    conda-forge
libgfortran5              13.1.0               h15d22d2_0    conda-forge
libgomp                   13.1.0               he5830b7_0    conda-forge
liblapack                 3.9.0           17_linux64_openblas    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.23          pthreads_h80387f5_0    conda-forge
libprotobuf               3.21.12              h3eb15da_0    conda-forge
libsqlite                 3.42.0               h2797004_0    conda-forge
libstdcxx-ng              13.1.0               hfd8a6a1_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
msgpack-python            1.0.5           py310hdf3cbec_0    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
numpy                     1.25.0          py310ha4c1d20_0    conda-forge
openssl                   3.1.1                hd590300_1    conda-forge
packaging                 23.1               pyhd8ed1ab_0    conda-forge
pip                       23.2               pyhd8ed1ab_0    conda-forge
pkgutil-resolve-name      1.3.10             pyhd8ed1ab_0    conda-forge
platformdirs              3.9.1              pyhd8ed1ab_0    conda-forge
pooch                     1.7.0              pyha770c72_3    conda-forge
protobuf                  4.21.12         py310heca2aa9_0    conda-forge
psutil                    5.9.5           py310h1fa729e_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.10.12         hd12c33a_0_cpython    conda-forge
python_abi                3.10                    3_cp310    conda-forge
pyyaml                    6.0             py310h5764c6d_5    conda-forge
ray-core                  2.5.1           py310h2ca9b2b_0    conda-forge
re2                       2023.02.01           hcb278e6_0    conda-forge
readline                  8.2                  h8228510_1    conda-forge
referencing               0.30.0             pyhd8ed1ab_0    conda-forge
requests                  2.31.0             pyhd8ed1ab_0    conda-forge
rpds-py                   0.9.2           py310hcb5633a_0    conda-forge
scipy                     1.11.1          py310ha4c1d20_0    conda-forge
setproctitle              1.2.2           py310h5764c6d_2    conda-forge
setuptools                68.0.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
typing-extensions         4.7.1                hd8ed1ab_0    conda-forge
typing_extensions         4.7.1              pyha770c72_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
urllib3                   2.0.3              pyhd8ed1ab_1    conda-forge
wheel                     0.40.0             pyhd8ed1ab_1    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
zipp                      3.16.2             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge
dss010101 commented 1 year ago

im seeing this issue as well...does anyone know if a rollback to previous version solves this? i am using debian 12 (bookworm).

what's interesting is i have to containers that are exactly the same image, just different tags and name and both running on the same server. the first instance that i brought up works fine. the 2nd instance exhibits the issue.

UPDATE: upon testing this fruther - it seems it can happen in either container. these two containers are on the same server but serving as different environments, one represents qa and the other prod. if i kick the process off at the same time, then i'll see the broken pipes in both:

  File "/apps/data/publish/publisher.py", line 23, in write_data
    log_ref = ray.put(self.log) if async_write else None
              ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/_private/worker.py", line 2597, in put
    object_ref = worker.put_object(value, owner_address=serialize_owner_address)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/_private/worker.py", line 704, in put_object
    self.core_worker.put_serialized_object_and_increment_local_ref(
  File "python/ray/_raylet.pyx", line 2939, in ray._raylet.CoreWorker.put_serialized_object_and_increment_local_ref
  File "python/ray/_raylet.pyx", line 2831, in ray._raylet.CoreWorker._create_put_buffer
  File "python/ray/_raylet.pyx", line 412, in ray._raylet.check_status

im was using latest ray 2.6.3. i rolled back to 2.6.1 and still see the issue.

dss010101 commented 1 year ago

feels like the two containers may be sharing some of the ray resources and interfering w/ each other...is there a way to start ray in each container such that it is independent and local to that container only? at the moment im simply using it as a multi-processing replacement for python multiprocessing library. seeing alot of these errors in som of the ray logs:

[2023-09-17 11:21:52,199 I 41 347] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2023-09-17 11:21:53,199 I 41 347] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2023-09-17 11:21:54,190 E 41 347] gcs_rpc_client.h:547: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.