[Ray cluster] Worker node is disappearing after some seconds

RahulMR42 commented 5 months ago

What happened + What you expected to happen

We are trying Ray cluster across GPU (A10s) to use VLLM
Head node is of type A10*4 Bare Metal VM with Oracle Linux 8
Node 1 is of A10*2 Bare metal VM with Oracle linux 8 .

Head node started perfectly


(env) [opc@ol8-ray-head-a10-4 vllm-master]$ ray status
======== Autoscaler status: 2024-05-07 14:22:01.766409 ========
Node status
---------------------------------------------------------------
Active:
1 node_7106733461c1761bf2b223541053883fcb1a9bbff7a3aadc6d61e3f3
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage: 0.0/128.0 CPU 0.0/4.0 GPU 0B/809.12GiB memory 0B/186.26GiB object_store_memory

Demands: (no resource demands)

- Started Worker node

(env) [opc@ol8-ray-node1-a10-2 logs]$ ray start --address='10.0.0.200:6379' Local node IP: 10.0.0.181

Ray runtime started.

To terminate the Ray runtime, run ray stop

- The GPUs (2) got added to Head and we could see 6 GPUS for some seconds and then it disappears

======== Autoscaler status: 2024-05-07 14:23:21.809749 ======== Node status

Active: 1 node_7106733461c1761bf2b223541053883fcb1a9bbff7a3aadc6d61e3f3 1 node_2cab848021d72d5fd7c4572e850ac394fb38597797ddd393f2c33c86 Pending: (no pending nodes) Recent failures: (no failures)

Resources

Usage: 0.0/188.0 CPU 0.0/6.0 GPU 0B/1.11TiB memory 0B/327.57GiB object_store_memory

Demands: (no resource demands) 2024-05-07 14:23:21,810 INFO autoscaler.py:470 -- The autoscaler took 0.001 seconds to complete the update iteration. 2024-05-07 14:23:26,812 INFO autoscaler.py:147 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes. 2024-05-07 14:23:26,812 INFO load_metrics.py:158 -- LoadMetrics: Removed ip: 2cab848021d72d5fd7c4572e850ac394fb38597797ddd393f2c33c86. 2024-05-07 14:23:26,812 INFO load_metrics.py:161 -- LoadMetrics: Removed 1 stale ip mappings: {'2cab848021d72d5fd7c4572e850ac394fb38597797ddd393f2c33c86'} not in {'7106733461c1761bf2b223541053883fcb1a9bbff7a3aadc6d61e3f3'} 2024-05-07 14:23:26,812 INFO autoscaler.py:427 -- ======== Autoscaler status: 2024-05-07 14:23:26.812551 ======== Node status

Active: 1 node_7106733461c1761bf2b223541053883fcb1a9bbff7a3aadc6d61e3f3 Pending: (no pending nodes) Recent failures: (no failures)

Resources

Usage: 0.0/128.0 CPU 0.0/4.0 GPU 0B/809.12GiB memory 0B/186.26GiB object_store_memory

Demands: (no resource demands) 2024-05-07 14:23:26,813 INFO autoscaler.py:470 -- The autoscaler took 0.001 seconds to complete the update iteration. 2024-05-07 14:23:31,814 INFO autoscaler.py:147 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes. 2024-05-07 14:23:31,815 INFO autoscaler.py:427 --

- Logs from Worker node 1

(env) [opc@ol8-ray-node1-a10-2 logs]$ tail - 50 raylet.out ==> standard input <== ^C (env) [opc@ol8-ray-node1-a10-2 logs]$ tail -50 raylet.out [state-dump] - cumulative published messages: 0 [state-dump] - cumulative processed messages: 0 [state-dump] num async plasma notifications: 0 [state-dump] Remote node managers: [state-dump] Event stats: [state-dump] Global stats: 28 total (13 active) [state-dump] Queueing time: mean = 1.349 ms, max = 10.637 ms, min = 9.481 us, total = 37.761 ms [state-dump] Execution time: mean = 36.839 ms, total = 1.031 s [state-dump] Event stats: [state-dump] PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), Execution time: mean = 159.773 us, total = 1.758 ms, Queueing time: mean = 3.430 ms, max = 10.637 ms, min = 20.052 us, total = 37.726 ms [state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ObjectManager.UpdateAvailableMemory - 1 total (0 active), Execution time: mean = 6.574 us, total = 6.574 us, Queueing time: mean = 9.481 us, max = 9.481 us, min = 9.481 us, total = 9.481 us [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.329 ms, total = 1.329 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 204.942 us, total = 204.942 us, Queueing time: mean = 12.859 us, max = 12.859 us, min = 12.859 us, total = 12.859 us [state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.523 ms, total = 1.523 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.930 ms, total = 1.930 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 1.025 s, total = 1.025 s, Queueing time: mean = 13.001 us, max = 13.001 us, min = 13.001 us, total = 13.001 us [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2024-05-07 14:18:31,433 I 10726 10726] (raylet) accessor.cc:627: Received notification for node id = 7106733461c1761bf2b223541053883fcb1a9bbff7a3aadc6d61e3f3, IsAlive = 1 [2024-05-07 14:18:31,433 I 10726 10726] (raylet) accessor.cc:627: Received notification for node id = 57c91cddaee2710e6a1aca530ae74b485e9c39fb9c250f5d84454b03, IsAlive = 1 [2024-05-07 14:18:48,431 I 10726 10726] (raylet) accessor.cc:627: Received notification for node id = 57c91cddaee2710e6a1aca530ae74b485e9c39fb9c250f5d84454b03, IsAlive = 0 [2024-05-07 14:18:48,448 C 10726 10726] (raylet) node_manager.cc:959: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded. StackTrace Information /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7c22a) [0x55bab965422a] ray::operator<<() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7d9e7) [0x55bab96559e7] ray::SpdLogMessage::Flush() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7de87) [0x55bab9655e87] ray::RayLog::~RayLog() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x2e6d81) [0x55bab8dbed81] ray::raylet::NodeManager::NodeRemoved() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a7c67) [0x55bab8f7fc67] ray::gcs::NodeInfoAccessor::HandleNotification() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5ac23e) [0x55bab908423e] EventTracker::RecordExecution() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5a562e) [0x55bab907d62e] std::_Function_handler<>::_M_invoke() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5a5aa6) [0x55bab907daa6] boost::asio::detail::completion_handler<>::do_complete() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc5de2b) [0x55bab9735e2b] boost::asio::detail::scheduler::do_run_one() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc603b9) [0x55bab97383b9] boost::asio::detail::scheduler::run() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc608d2) [0x55bab97388d2] boost::asio::io_context::run() /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1cd4ba) [0x55bab8ca54ba] main /lib64/libc.so.6(libc_start_main+0xe5) [0x7f881fcd8e45] libc_start_main /home/opc/vllm-node/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x221ac7) [0x55bab8cf9ac7]

Please let us know is tthere an option to increase HB timeout or fix the issue?

### Versions / Dependencies

- Ray  (Same on head and worker)

(env) [opc@ol8-ray-node1-a10-2 logs]$ ray --version ray, version 2.20.0 (env) [opc@ol8-ray-node1-a10-2 logs]$

- Python (Same on head and worker)

env) [opc@ol8-ray-node1-a10-2 logs]$ python -V Python 3.11.5 (env) [opc@ol8-ray-node1-a10-2 logs]$



### Reproduction script

Use Oracle linux baremetal A10-4 or A10-2 and try adding them each other.

### Issue Severity

High: It blocks me from completing my task.

RahulMR42 commented 5 months ago

Any help please ?

RahulMR42 commented 5 months ago

Some more logs when trying with VLLM , with in the 30 seconds when the worker joined

(env) (base) [opc@ol8ray-1 ray-vllm]$ python -m vllm.entrypoints.openai.api_server --model Phind/Phind-CodeLlama-34B-v2 --tensor-parallel-size 4
/home/opc/ray-vllm/env/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
2024-05-08 14:22:55,479 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.0.0.204:6379...
2024-05-08 14:22:55,484 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(raylet) The node with node id: 0f219779261fab027e79667e6451e061ded81c76f07d05584652d818 and address: 10.0.0.51 and node name: 10.0.0.51 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a   (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
    (2) raylet has lagging heartbeats due to slow network or busy workload.
(raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [2024-05-08 14:22:54,732 I 82642 82642] (raylet) accessor.cc:627: Received notification for node id = 0f219779261fab027e79667e6451e061ded81c76f07d05584652d818, IsAlive = 1
    [2024-05-08 14:22:56,471 I 82642 82642] (raylet) node_manager.cc:587: New job has started. Job id 01000000 Driver pid 288832 is dead: 0 driver address: 10.0.0.204
    [2024-05-08 14:23:25,475 I 82642 82642] (raylet) accessor.cc:627: Received notification for node id = 0f219779261fab027e79667e6451e061ded81c76f07d05584652d818, IsAlive = 0
    [2024-05-08 14:23:25,492 C 82642 82642] (raylet) node_manager.cc:959: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
    *** StackTrace Information ***
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7c22a) [0x558f3c8f422a] ray::operator<<()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7d9e7) [0x558f3c8f59e7] ray::SpdLogMessage::Flush()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7de87) [0x558f3c8f5e87] ray::RayLog::~RayLog()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x2e6d81) [0x558f3c05ed81] ray::raylet::NodeManager::NodeRemoved()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a7c67) [0x558f3c21fc67] ray::gcs::NodeInfoAccessor::HandleNotification()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5ac23e) [0x558f3c32423e] EventTracker::RecordExecution()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5a562e) [0x558f3c31d62e] std::_Function_handler<>::_M_invoke()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5a5aa6) [0x558f3c31daa6] boost::asio::detail::completion_handler<>::do_complete()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc5de2b) [0x558f3c9d5e2b] boost::asio::detail::scheduler::do_run_one()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc603b9) [0x558f3c9d83b9] boost::asio::detail::scheduler::run()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc608d2) [0x558f3c9d88d2] boost::asio::io_context::run()
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1cd4ba) [0x558f3bf454ba] main
    /lib64/libc.so.6(__libc_start_main+0xe5) [0x7fd8e83ace45] __libc_start_main
    /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x221ac7) [0x558f3bf99ac7]

(raylet, ip=10.0.0.51) [2024-05-08 14:23:25,492 C 82642 82642] (raylet) node_manager.cc:959: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
(raylet, ip=10.0.0.51) *** StackTrace Information ***
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7c22a) [0x558f3c8f422a] ray::operator<<()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7d9e7) [0x558f3c8f59e7] ray::SpdLogMessage::Flush()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xb7de87) [0x558f3c8f5e87] ray::RayLog::~RayLog()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x2e6d81) [0x558f3c05ed81] ray::raylet::NodeManager::NodeRemoved()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x4a7c67) [0x558f3c21fc67] ray::gcs::NodeInfoAccessor::HandleNotification()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5ac23e) [0x558f3c32423e] EventTracker::RecordExecution()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5a562e) [0x558f3c31d62e] std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x5a5aa6) [0x558f3c31daa6] boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc5de2b) [0x558f3c9d5e2b] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc603b9) [0x558f3c9d83b9] boost::asio::detail::scheduler::run()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xc608d2) [0x558f3c9d88d2] boost::asio::io_context::run()
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1cd4ba) [0x558f3bf454ba] main
(raylet, ip=10.0.0.51) /lib64/libc.so.6(__libc_start_main+0xe5) [0x7fd8e83ace45] __libc_start_main
(raylet, ip=10.0.0.51) /home/opc/ray-vllm/env/lib64/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x221ac7) [0x558f3bf99ac7]

jjyao commented 4 months ago

Hi @RahulMR42, seems the worker node is marked dead by the GCS due to missing heartbeats. Could you check the gcs log on the head node to see what it says. Also do you have good connection between head node and worker nodes? You can also tune the heartbeat configs

/// The timeout for a health check.
RAY_CONFIG(int64_t, health_check_timeout_ms, 10000)
/// The threshold to consider a node dead.
RAY_CONFIG(int64_t, health_check_failure_threshold, 5)

by setting RAY_health_check_timeout_ms=20000 env var before starting Ray.

thiswillbeyourgithub commented 3 months ago

I think I have the same issue but more importantly I note that 'RAY_health_check_timeout_ms' does not appear anywhere in the docs and I can't find the list of supported environnment variables to try to debug that.

thiswillbeyourgithub commented 3 months ago

Addendum, your comment lead me to this page of code so I decided to try to max all of them.

This is the script I run on my powerful distant machine via ssh:

# on my powerful remote server:
LARGE_N=99999999999
export RAY_health_check_timeout_ms=$LARGE_N
export RAY_grpc_keepalive_time_ms=$LARGE_N
export RAY_grpc_client_keepalive_time_ms=$LARGE_N
export RAY_grpc_client_keepalive_timeout_ms=$LARGE_N
export RAY_health_check_initial_delay_ms=$LARGE_N
export RAY_health_check_period_ms=$LARGE_N
export RAY_health_check_timeout_ms=$LARGE_N
export RAY_health_check_failure_threshold=10

env | grep RAY # to check that the env are indeed set

ray start --block --head --disable-usage-stats --verbose --dashboard-host 0.0.0.0

then on my local machine ray start and for the first time it didn't crash after a few seconds.

So I did a dichotomy search and it seems that either of those env variable fixes the issue:

export RAY_health_check_initial_delay_ms=$LARGE_N
export RAY_health_check_period_ms=$LARGE_N

lunapapa-finland commented 2 months ago

I am experiencing the same issue, and adding export RAY_health_check_initial_delay_ms=$LARGE_N export RAY_health_check_period_ms=$LARGE_N is not perfectly solve the problem. Is that any follow-up for this issue? This is not in related to the network cause the machines are connected within the LAN.

All the worker nodes will crash after less than 10 second without running any tasks yet.

anyscalesam commented 3 weeks ago

@lunapapa-finland what ray ver are you on?

ray-project / ray