ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.97k stars 5.77k forks source link

[<Ray component: Serve>] Worker node is killed after starting with reason of missing too many heartbeat checks #46548

Open JadenFiotto-Kaufman opened 4 months ago

JadenFiotto-Kaufman commented 4 months ago

What happened + What you expected to happen

I am trying to run a two node cluster using Ray, both in Docker containers. The head node works great, I can see the dashboard as well as all of my deployments as healthy.

Trying to start the second worker node however, does not. It initially comes up HEALTHY , and then almost immediately is killed. Looking at the Ray dashboard, the state is DEAD and the state message is Unexpected termination: health check failed due to missing too many heartbeats

Under the Overview Event log I see:

The node with node id: 510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a and address: <address> and node name: <address> has been marked dead because the detector has missed too many heartbeats from it. This can happen when a    (1) raylet crashes unexpectedly (OOM, etc.) 
    (2) raylet has lagging heartbeats due to slow network or busy workload.

From what I can tell, memory isn't an issue and these machines have fast network connection to each other. I have also tried another machine for the worker node and achieved the same result.

I see some old issues involving setting a higher number of heartbeat timeouts like: --system-config "{\"num_heartbeats_timeout\":300000000000}"

However it seems num_heartbeats_timeout is no longer an option? Is there an updated version of this I can use?

Let me know if there are any other logs I can provide that would be helpful.

/tmp/ray/session_latest/logs/raylet.out

[2024-07-10 19:44:49,243 I 173 173] (raylet) main.cc:180: Setting cluster ID to: 14ddaf66adbb0c20018a80760985ed1097bc09e405ac8c8feecf767b
[2024-07-10 19:44:49,255 I 173 173] (raylet) main.cc:285: Raylet is not set to kill unknown children.
[2024-07-10 19:44:49,256 I 173 173] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-07-10 19:44:49,256 I 173 173] (raylet) main.cc:414: Setting node ID to: 510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a
[2024-07-10 19:44:49,257 I 173 173] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 20GB of memory.
[2024-07-10 19:44:49,257 I 173 173] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2024-07-10 19:44:49,258 I 173 206] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(20000014344, /dev/shm/plasmaXXXXXX)
[2024-07-10 19:44:49,261 I 173 206] (raylet) store.cc:564: ========== Plasma store: =================
Current usage: 0 / 20 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2024-07-10 19:44:49,267 I 173 173] (raylet) grpc_server.cc:134: ObjectManager server started, listening on port 46303.
[2024-07-10 19:44:49,272 I 173 173] (raylet) worker_killing_policy.cc:101: Running GroupByOwner policy.
[2024-07-10 19:44:49,272 I 173 173] (raylet) memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 2055270039552 bytes (0.95 system memory), total system memory bytes: 2163442143232
[2024-07-10 19:44:49,272 I 173 173] (raylet) node_manager.cc:287: Initializing NodeManager node_id=510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a
[2024-07-10 19:44:49,273 I 173 173] (raylet) grpc_server.cc:134: NodeManager server started, listening on port 44485.
[2024-07-10 19:44:49,281 I 173 252] (raylet) agent_manager.cc:77: Monitor agent process with name dashboard_agent/1075256796
[2024-07-10 19:44:49,281 I 173 173] (raylet) event.cc:234: Set ray event level to warning
[2024-07-10 19:44:49,282 I 173 173] (raylet) event.cc:342: Ray Event initialized for RAYLET
[2024-07-10 19:44:49,282 I 173 254] (raylet) agent_manager.cc:77: Monitor agent process with name runtime_env_agent
[2024-07-10 19:44:49,285 I 173 173] (raylet) raylet.cc:134: Raylet of id, 510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a started. Raylet consists of node_manager and object_manager. node_manager address: 192.168.16.2:44485 object_manager address: 192.168.16.2:46303 hostname: 0f9e9d796fc4
[2024-07-10 19:44:49,286 I 173 173] (raylet) node_manager.cc:525: [state-dump] NodeManager:
[state-dump] Node ID: 510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a
[state-dump] Node name: <address>
[state-dump] InitialConfigResources: {CPU: 1920000, cuda_memory_MB: 6797930000, accelerator_type:A100: 10000, object_store_memory: 200000000000000, node:192.168.16.2: 10000, memory: 21431864872960000, GPU: 80000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state: 
[state-dump] Local id: -788914548151358435 Local resources: {"total":{CPU: [1920000], GPU: [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], memory: [21431864872960000], node:192.168.16.2: [10000], object_store_memory: [200000000000000], accelerator_type:A100: [10000], cuda_memory_MB: [6797930000]}}, "available": {CPU: [1920000], GPU: [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], memory: [21431864872960000], node:192.168.16.2: [10000], object_store_memory: [200000000000000], accelerator_type:A100: [10000], cuda_memory_MB: [6797930000]}}, "labels":{"ray.io/node_id":"510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a",} is_draining: 0 is_idle: 1 Cluster resources: node id: -788914548151358435{"total":{CPU: 1920000, GPU: 80000, memory: 21431864872960000, node:192.168.16.2: 10000, object_store_memory: 200000000000000, accelerator_type:A100: 10000, cuda_memory_MB: 6797930000}}, "available": {CPU: 1920000, GPU: 80000, memory: 21431864872960000, node:192.168.16.2: 10000, object_store_memory: 200000000000000, accelerator_type:A100: 10000, cuda_memory_MB: 6797930000}}, "labels":{"ray.io/node_id":"510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump] 
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump] 
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 0.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 20000000000
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump] 
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers: 
[state-dump] Event stats:
[state-dump] Global stats: 27 total (13 active)
[state-dump] Queueing time: mean = 1.283 ms, max = 9.755 ms, min = 10.647 us, total = 34.634 ms
[state-dump] Execution time:  mean = 1.204 ms, total = 32.495 ms
[state-dump] Event stats:
[state-dump]    PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), Execution time: mean = 137.231 us, total = 1.510 ms, Queueing time: mean = 3.146 ms, max = 9.755 ms, min = 24.802 us, total = 34.609 ms
[state-dump]    NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 210.221 us, total = 210.221 us, Queueing time: mean = 10.647 us, max = 10.647 us, min = 10.647 us, total = 10.647 us
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 3.032 ms, total = 3.032 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 495.271 us, total = 495.271 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 700.981 us, total = 700.981 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 26.548 ms, total = 26.548 ms, Queueing time: mean = 14.421 us, max = 14.421 us, min = 14.421 us, total = 14.421 us
[state-dump]    RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump] 
[state-dump] 
[2024-07-10 19:44:49,288 I 173 173] (raylet) accessor.cc:668: Received notification for node id = 510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a, IsAlive = 1
[2024-07-10 19:44:49,288 I 173 173] (raylet) accessor.cc:668: Received notification for node id = 23495ee763d81514c8d7107770bd9d31c2470b500e1d4e95e4afd7bc, IsAlive = 1
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 07000000 Driver pid 677 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 04000000 Driver pid 537 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 05000000 Driver pid 607 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 0c000000 Driver pid 1246 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 02000000 Driver pid 500 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 03000000 Driver pid 572 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 0a000000 Driver pid 1079 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 0b000000 Driver pid 1158 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 09000000 Driver pid 1040 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 01000000 Driver pid 223 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 06000000 Driver pid 642 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 0d000000 Driver pid 1281 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:44:49,290 I 173 173] (raylet) node_manager.cc:610: New job has started. Job id 08000000 Driver pid 793 is dead: 0 driver address: 172.21.0.3
[2024-07-10 19:45:06,287 I 173 173] (raylet) accessor.cc:668: Received notification for node id = 510c08bfbe1181a04dfcf3cc7c256a189ea984a70293efbe2257c20a, IsAlive = 0
[2024-07-10 19:45:06,303 C 173 173] (raylet) node_manager.cc:1041: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xbde15a) [0x55868f34315a] ray::operator<<()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xbe06d1) [0x55868f3456d1] ray::RayLog::~RayLog()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x300ebd) [0x55868ea65ebd] ray::raylet::NodeManager::NodeRemoved()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x4dd744) [0x55868ec42744] ray::gcs::NodeInfoAccessor::HandleNotification()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x60936c) [0x55868ed6e36c] EventTracker::RecordExecution()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x604f4e) [0x55868ed69f4e] std::_Function_handler<>::_M_invoke()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x6053c6) [0x55868ed6a3c6] boost::asio::detail::completion_handler<>::do_complete()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xcc056b) [0x55868f42556b] boost::asio::detail::scheduler::do_run_one()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xcc2af9) [0x55868f427af9] boost::asio::detail::scheduler::run()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xcc3012) [0x55868f428012] boost::asio::io_context::run()
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x1d9c11) [0x55868e93ec11] main
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f45994f9d90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f45994f9e40] __libc_start_main
/opt/conda/envs/service/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x230de7) [0x55868e995de7]

Versions / Dependencies

Ray: 2.31.0 Python: 3.10.14 Docker: 26.1.2 VM OS: ubuntu:22.04

Reproduction script

Head start.sh:

#!/bin/bash

ray start --head \
    --resources="$resources" \
    --port=6379 \
    --object-manager-port=8076 \
    --include-dashboard=true \
    --dashboard-host=0.0.0.0 \
    --dashboard-port=8265 

serve deploy src/ray/config/ray_config.yml

tail -f /dev/null

Worker start.sh:

#!/bin/bash

ray start --resources "$resources" --address $RAY_ADDRESS --block

tail -f /dev/null

Issue Severity

High: It blocks me from completing my task.

yx367563 commented 4 months ago

Maybe you can refer to https://github.com/ray-project/ray/issues/45179

anyscalesam commented 2 months ago

cc @rkooo567

jjyao commented 2 months ago

Besides tuning the heartbeat configs

/// The following are configs for the health check. They are borrowed
/// from k8s health probe (shorturl.at/jmTY3)
/// The delay to send the first health check.
RAY_CONFIG(int64_t, health_check_initial_delay_ms, 5000)
/// The interval between two health check.
RAY_CONFIG(int64_t, health_check_period_ms, 3000)
/// The timeout for a health check.
RAY_CONFIG(int64_t, health_check_timeout_ms, 10000)
/// The threshold to consider a node dead.
RAY_CONFIG(int64_t, health_check_failure_threshold, 5)

can you also check to see if head can ping the worker node to make sure network connection is not an issue. You should ping this port [2024-07-10 19:44:49,273 I 173 173] (raylet) grpc_server.cc:134: NodeManager server started, listening on port 44485.

theshadow76 commented 2 months ago

I've got the same error here.

rkooo567 commented 1 month ago

Trying to start the second worker node however, does not. It initially comes up HEALTHY , and then almost immediately is killed. Looking at the Ray dashboard, the state is DEAD and the state message is Unexpected termination: health check failed due to missing too many heartbeats

Were you able to see it from the dashboard? Ray head node sends heartbeat to worker node periodically, and if you see immediate death, it is usually the network between head -> worker is not established. As @jjyao said, high likelihood issue I"ve seen before is that the worker node port is not properly configured. See https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations