Open myron0330 opened 1 year ago
Some error logs may help in /tmp/ray/session_latest/logs/raylet.out:
metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: This won't affect ray, but you can lose metrices from the cluster.
(raylet) client_connection.cc:528: [worker]ProcessMessage with type RegisterClientRequest took 27561ms.
(raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip, id 0.
(raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0
(raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shared with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See 'dashboard_agent.log' for the root cause.
Hey @myron0330 - Could you also share the dashboard_agent.log
and dashboard.log
file?
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Someone needs to reproduce it on our side. I don't think it's related to cpu affinity.
What happened + What you expected to happen
Script
Problem
I have a production machine that should limiting each process's cpu core affinity. The process hungs after running the above script, and raises error after I kill the process as follows:
Expectation
Run it properly.
Versions / Dependencies
System version: CentOS Linux release 7.9 2009 (Core) Ray version (pip freeze): 2.0.0
Reproduction script
Issue Severity
High: It blocks me from completing my task.