ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.96k stars 5.77k forks source link

[Core] Failed to register worker on a mechine with limiting cpu affinity. #34457

Open myron0330 opened 1 year ago

myron0330 commented 1 year ago

What happened + What you expected to happen

Script

import os
import ray

os.sched_setaffinity(os.getpid(), set(range(10, os.cpu_count())))
ray.init(include_dashboard=False)

Problem

I have a production machine that should limiting each process's cpu core affinity. The process hungs after running the above script, and raises error after I kill the process as follows:

core_worker.cc:149: Failed to register worker 01000000ffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory.

Expectation

Run it properly.

Versions / Dependencies

System version: CentOS Linux release 7.9 2009 (Core) Ray version (pip freeze): 2.0.0

Reproduction script

import os
import ray

os.sched_setaffinity(os.getpid(), set(range(10, os.cpu_count())))
ray.init(include_dashboard=False)

Issue Severity

High: It blocks me from completing my task.

myron0330 commented 1 year ago

Some error logs may help in /tmp/ray/session_latest/logs/raylet.out:

metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: This won't affect ray, but you can lose metrices from the cluster.

(raylet) client_connection.cc:528: [worker]ProcessMessage with type RegisterClientRequest took 27561ms.
(raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip, id 0.
(raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0
(raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shared with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See 'dashboard_agent.log' for the root cause. 
rickyyx commented 1 year ago

Hey @myron0330 - Could you also share the dashboard_agent.log and dashboard.log file?

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

jjyao commented 8 months ago

Someone needs to reproduce it on our side. I don't think it's related to cpu affinity.