[Core] Transient Ray head start failure due to dead dashboard agent

DmitriGekhtman commented 8 months ago

What happened + What you expected to happen

We are occasionally seeing failure of the Ray head node shortly after startup time. For our applications, the issue manifests in the following way.

The application calls ray start --head in a subprocess, then health-checks the GCS by calling ray health-check in a subprocess until success, then calls ray.init and retries ray.init until success. ray.init raises errors "This node has an IP address of x.x.x.x, and Ray expects this IP address to be either the GCS address or one of the Raylet addresses. Connected to GCS at x.x.x.x and found raylets at y.y.y.y but none of these match this node's IP x.x.x.x. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node." and global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?

Digging into the logs appears to indicate that the Raylet died due to death of the dashboard agent:

Dashboard agent logs have no useful info:

The following lines in the Raylet logs are the most interesting:

[2024-01-17 23:38:48,160 W 190 190] (raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0
[2024-01-17 23:38:48,168 I 190 237] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, exit code 0. ip . id 0

Some follow-up questions:

How can we debug further why the agent process is timing out?
Is the timeout configurable?

Versions / Dependencies

Ray 2.2.0 (Upgrading to latest version is not straightforward for us.)

Reproduction script

The issue is transient -- we don't know how to reproduce it.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

rkooo567 commented 8 months ago

Can you share the log file contents dashboard_agent.log?

Also, you are using an really old version of Ray. Is it possible to try the latest version?

DmitriGekhtman commented 8 months ago

Dashboard agent logs are shown as one of the screen-shots in the issue description under the head "Dashboard agent logs have no useful info." The last log line is "Get all modules by type: DashboardAgentModule"

We're not easily able to upgrade the Ray version (it requires a thorough cross-team effort to test all functionality that depends on Ray.) I do see that the log message "Agent process expected id xxxxx timed out before registering" no longer exists in upstream master, which suggests that the error would at least manifest differently in a recent Ray version. Based on your understanding of Ray core development history, do you think the issue would be resolved with a version upgrade? Have there been major architectural changes to the dashboard agent and/or Raylet?

DmitriGekhtman commented 7 months ago

We're going to try to start coordinating a version upgrade in the not-distant future, but it'll take a while.

rkooo567 commented 7 months ago

Have there been major architectural changes to the dashboard agent and/or Raylet?

There has been major update in discovery of the agent process due to one bug. But I don't know if the bug exists in your environment.

This typically means the dashboard agent was not ready to be started within the timeout. I think a couple things you can try.

One thing you can also try is to use different gRPC versions. Search grpc & agent in the repo and see the working versions suggested there.

rkooo567 commented 7 months ago

P2 until author takes actions and confirmed it is a bug

DmitriGekhtman commented 7 months ago

Thanks for the info! Sounds like it's indeed worthwhile for us to update at this point. Updating the Ray version in our monorepo sounds less frightening than fiddling with gRPC versions :)

anyscalesam commented 7 months ago

@DmitriGekhtman have you had a chance to upgrade to ray latest?

DmitriGekhtman commented 7 months ago

Not yet, I'll keep you posted.

teocns commented 4 months ago

This happens to me when running in docker targeting linux/amd64 on an M2 host machine linux/arm64. Also see this

annyan09023 commented 4 months ago

@rkooo567 can you share the bug that you referred and the PR to fix it? Do you mean upgrading gRPC version is the solution to fix the bug?

ray-project / ray