Open JadenFiotto-Kaufman opened 4 months ago
Maybe you can refer to https://github.com/ray-project/ray/issues/45179
cc @rkooo567
Besides tuning the heartbeat configs
/// The following are configs for the health check. They are borrowed
/// from k8s health probe (shorturl.at/jmTY3)
/// The delay to send the first health check.
RAY_CONFIG(int64_t, health_check_initial_delay_ms, 5000)
/// The interval between two health check.
RAY_CONFIG(int64_t, health_check_period_ms, 3000)
/// The timeout for a health check.
RAY_CONFIG(int64_t, health_check_timeout_ms, 10000)
/// The threshold to consider a node dead.
RAY_CONFIG(int64_t, health_check_failure_threshold, 5)
can you also check to see if head can ping the worker node to make sure network connection is not an issue. You should ping this port [2024-07-10 19:44:49,273 I 173 173] (raylet) grpc_server.cc:134: NodeManager server started, listening on port 44485.
I've got the same error here.
Trying to start the second worker node however, does not. It initially comes up HEALTHY , and then almost immediately is killed. Looking at the Ray dashboard, the state is DEAD and the state message is Unexpected termination: health check failed due to missing too many heartbeats
Were you able to see it from the dashboard? Ray head node sends heartbeat to worker node periodically, and if you see immediate death, it is usually the network between head -> worker is not established. As @jjyao said, high likelihood issue I"ve seen before is that the worker node port is not properly configured. See https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations
What happened + What you expected to happen
I am trying to run a two node cluster using Ray, both in Docker containers. The head node works great, I can see the dashboard as well as all of my deployments as healthy.
Trying to start the second worker node however, does not. It initially comes up
HEALTHY
, and then almost immediately is killed. Looking at the Ray dashboard, the state isDEAD
and the state message isUnexpected termination: health check failed due to missing too many heartbeats
Under the Overview Event log I see:
From what I can tell, memory isn't an issue and these machines have fast network connection to each other. I have also tried another machine for the worker node and achieved the same result.
I see some old issues involving setting a higher number of heartbeat timeouts like:
--system-config "{\"num_heartbeats_timeout\":300000000000}"
However it seems num_heartbeats_timeout is no longer an option? Is there an updated version of this I can use?
Let me know if there are any other logs I can provide that would be helpful.
/tmp/ray/session_latest/logs/raylet.out
Versions / Dependencies
Ray: 2.31.0 Python: 3.10.14 Docker: 26.1.2 VM OS: ubuntu:22.04
Reproduction script
Head start.sh:
Worker start.sh:
Issue Severity
High: It blocks me from completing my task.