ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.13k stars 5.61k forks source link

[Ray Core] Ray agent crashes: grpcio version mismatch and unexpected errors (dashboard_agent, runtime_env_agent) #45519

Open Manish-2004 opened 4 months ago

Manish-2004 commented 4 months ago

What happened + What you expected to happen

After submitting Ray hyperparamter tune job using JobSubmissionClient getting below error,we were suggested to downgrade grpcio version but it still gives same error https://ray-distributed.slack.com/archives/CNECXMW22/p1716203022795889

(raylet) agent_manager.cc:84: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/1656919085 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) The raylet fate shares with the agent.This can happen because[32m [repeated 2x across cluster][0m The version ofgrpciodoesn't follow Ray's requirement. Agent can segfault with the incorrectgrpcioversion. Check the grpcio versionpip freeze | grep grpcio.[32m [repeated 2x across cluster] The agent failed to start because of unexpected error or port conflict. Read the logcat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.[32m [repeated 2x across cluster] The agent is killed by the OS (e.g., out of memory).[32m [repeated 2x across cluster][0m

Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars): [2024-05-22 22:21:12,222 E 89 89] (raylet) logging.cc:101: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use [system:98 at external/boost/boost/asio/detail/reactive_socket_service.hpp:161 in function 'bind'][2024-05-22 22:21:12,307 E 89 89] (raylet) logging.cc:108: Stack trace: /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xbab668) [0x5613a3487668] ray::TerminateHandler() /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1cd1e4) [0x5613a2aa91e4] boost::throw_exception<>() /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc8dc3b) [0x5613a3569c3b] boost::asio::detail::do_throw_error() /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x31a9e8) [0x5613a2bf69e8] ray::raylet::Raylet::Raylet() *** SIGABRT received at time=1716441672 on cpu 2 *** PC: @ 0x7f8d2128b00b (unknown) raise [2024-05-22 22:21:12,310 E 89 89] (raylet) logging.cc:365: *** SIGABRT received at time=1716441672 on cpu 2 *** [2024-05-22 22:21:12,310 E 89 89] (raylet) logging.cc:365: PC: @ 0x7f8d2128b00b (unknown) raise

Versions / Dependencies

KubeRay Operator v1.1.1 Ray v2.21.0

Reproduction script

import ray from ray.job_submission import JobSubmissionClient import time

Ray cluster information for connection

ray_head_ip = "kuberay-head-svc.kuberay.svc.cluster.local" ray_head_port = 8265 ray_address = f"http://{ray_head_ip}:{ray_head_port}" client = JobSubmissionClient(ray_address)

Submit Ray job using JobSubmissionClient

job_id = client.submit_job( entrypoint="python xg.py", runtime_env={ "working_dir": "./", }, entrypoint_num_cpus=3 )

print(f"Ray job submitted with job_id: {job_id}")

Waiting for Ray to finish the job and print the result

while True: status = client.get_job_status(job_id) if status in [ray.job_submission.JobStatus.RUNNING, ray.job_submission.JobStatus.PENDING]: time.sleep(5) else: break client.get_job_logs(job_id)

Issue Severity

Low: It annoys or frustrates me.

jjyao commented 3 months ago

Discussing on slack.

anyscalesam commented 3 months ago

Conclusions here @jjyao ?

meijiesky commented 1 month ago

any updates on this issue?

anyscalesam commented 1 month ago

@meijiesky we cleaned up our dependencies and should have a safe floor of grpcio version in our latest Ray releases Docker images; are you running into issues?