Open Manish-2004 opened 5 months ago
Discussing on slack.
Conclusions here @jjyao ?
any updates on this issue?
@meijiesky we cleaned up our dependencies and should have a safe floor of grpcio version in our latest Ray releases Docker images; are you running into issues?
What happened + What you expected to happen
After submitting Ray hyperparamter tune job using JobSubmissionClient getting below error,we were suggested to downgrade grpcio version but it still gives same error https://ray-distributed.slack.com/archives/CNECXMW22/p1716203022795889
(raylet) agent_manager.cc:84: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/1656919085 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) The raylet fate shares with the agent.This can happen because[32m [repeated 2x across cluster][0m The version of
grpciodoesn't follow Ray's requirement. Agent can segfault with the incorrect
grpcioversion. Check the grpcio version
pip freeze | grep grpcio.[32m [repeated 2x across cluster] The agent failed to start because of unexpected error or port conflict. Read the log
cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.[32m [repeated 2x across cluster] The agent is killed by the OS (e.g., out of memory).[32m [repeated 2x across cluster][0m
Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars): [2024-05-22 22:21:12,222 E 89 89] (raylet) logging.cc:101: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use [system:98 at external/boost/boost/asio/detail/reactive_socket_service.hpp:161 in function 'bind'][2024-05-22 22:21:12,307 E 89 89] (raylet) logging.cc:108: Stack trace: /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xbab668) [0x5613a3487668] ray::TerminateHandler() /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1cd1e4) [0x5613a2aa91e4] boost::throw_exception<>() /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc8dc3b) [0x5613a3569c3b] boost::asio::detail::do_throw_error() /home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x31a9e8) [0x5613a2bf69e8] ray::raylet::Raylet::Raylet() *** SIGABRT received at time=1716441672 on cpu 2 *** PC: @ 0x7f8d2128b00b (unknown) raise [2024-05-22 22:21:12,310 E 89 89] (raylet) logging.cc:365: *** SIGABRT received at time=1716441672 on cpu 2 *** [2024-05-22 22:21:12,310 E 89 89] (raylet) logging.cc:365: PC: @ 0x7f8d2128b00b (unknown) raise
Versions / Dependencies
KubeRay Operator v1.1.1 Ray v2.21.0
Reproduction script
import ray from ray.job_submission import JobSubmissionClient import time
Ray cluster information for connection
ray_head_ip = "kuberay-head-svc.kuberay.svc.cluster.local" ray_head_port = 8265 ray_address = f"http://{ray_head_ip}:{ray_head_port}" client = JobSubmissionClient(ray_address)
Submit Ray job using JobSubmissionClient
job_id = client.submit_job( entrypoint="python xg.py", runtime_env={ "working_dir": "./", }, entrypoint_num_cpus=3 )
print(f"Ray job submitted with job_id: {job_id}")
Waiting for Ray to finish the job and print the result
while True: status = client.get_job_status(job_id) if status in [ray.job_submission.JobStatus.RUNNING, ray.job_submission.JobStatus.PENDING]: time.sleep(5) else: break client.get_job_logs(job_id)
Issue Severity
Low: It annoys or frustrates me.