Open DmitriGekhtman opened 8 months ago
Can you share the log file contents dashboard_agent.log?
Also, you are using an really old version of Ray. Is it possible to try the latest version?
Dashboard agent logs are shown as one of the screen-shots in the issue description under the head "Dashboard agent logs have no useful info." The last log line is "Get all modules by type: DashboardAgentModule"
We're not easily able to upgrade the Ray version (it requires a thorough cross-team effort to test all functionality that depends on Ray.) I do see that the log message "Agent process expected id xxxxx timed out before registering" no longer exists in upstream master, which suggests that the error would at least manifest differently in a recent Ray version. Based on your understanding of Ray core development history, do you think the issue would be resolved with a version upgrade? Have there been major architectural changes to the dashboard agent and/or Raylet?
We're going to try to start coordinating a version upgrade in the not-distant future, but it'll take a while.
Have there been major architectural changes to the dashboard agent and/or Raylet?
There has been major update in discovery of the agent process due to one bug. But I don't know if the bug exists in your environment.
This typically means the dashboard agent was not ready to be started within the timeout. I think a couple things you can try.
One thing you can also try is to use different gRPC versions. Search grpc & agent in the repo and see the working versions suggested there.
P2 until author takes actions and confirmed it is a bug
Thanks for the info! Sounds like it's indeed worthwhile for us to update at this point. Updating the Ray version in our monorepo sounds less frightening than fiddling with gRPC versions :)
@DmitriGekhtman have you had a chance to upgrade to ray latest?
Not yet, I'll keep you posted.
This happens to me when running in docker targeting linux/amd64
on an M2 host machine linux/arm64
. Also see this
@rkooo567 can you share the bug that you referred and the PR to fix it? Do you mean upgrading gRPC version is the solution to fix the bug?
What happened + What you expected to happen
We are occasionally seeing failure of the Ray head node shortly after startup time. For our applications, the issue manifests in the following way.
The application calls
ray start --head
in a subprocess, then health-checks the GCS by callingray health-check
in a subprocess until success, then callsray.init
and retriesray.init
until success.ray.init
raises errors"This node has an IP address of x.x.x.x, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at x.x.x.x and found raylets at y.y.y.y but none of these match this node's IP x.x.x.x. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node."
andglobal_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
Digging into the logs appears to indicate that the Raylet died due to death of the dashboard agent:
Dashboard agent logs have no useful info:
The following lines in the Raylet logs are the most interesting:
Some follow-up questions:
Versions / Dependencies
Ray 2.2.0 (Upgrading to latest version is not straightforward for us.)
Reproduction script
The issue is transient -- we don't know how to reproduce it.
Issue Severity
Medium: It is a significant difficulty but I can work around it.