[core] chaos_many_actors failing due to node crash, GCS unreachable

stephanie-wang commented 2 years ago

What happened + What you expected to happen

command_scd_W8xe3jzdaCEE1qykk7UwiMEk.log

Nodes dying:

(raylet, ip=172.31.84.216) [2022-09-15 09:58:43,756 C 3288 3288] (raylet) node_manager.cc:999: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.

And later GCS crashing or runreachable:

Chaos test: 100%|██████████| 86400/86400 [26:48<00:00, 53.70it/s] [2022-09-15 10:14:56,511 C 1168 1246] gcs_rpc_client.h:537:  Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeout_s() Failed to connect to GCS within 60 seconds

Versions / Dependencies

Good commit: 7ba37885c67844d2eb18a63dcf3b7ac7f66ce89f Bad commit: b878fd28d25f17d5bf23563f73c4ab19b6ab625e

Reproduction script

-

Issue Severity

No response

stephanie-wang commented 2 years ago

As far as I can tell, no core changes made between these commits, but it's now failed twice in a row.

rickyyx commented 1 year ago

Rerunning it here: https://buildkite.com/ray-project/release-tests-branch/builds/1004#0183472b-36a9-4e93-bc1c-fa7c3a0ae5bb

rickyyx commented 1 year ago

Rerunning failed as well. Running it manually here https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_hwhsTD54MJmfFjY4Q4jjW72Q?command-history-section=command_history

so that I could get access to the gcs logs.

rickyyx commented 1 year ago

Still haven't been able to ping down the root cause yet, essentially the test gets stuck at a weird states with GCS being alive:

1. Looping of reporting dead raylet on the same dead node

2022-09-16 15:06:49,327 WARNING worker.py:1828 -- The node with node id: 9773b3cedec83f2a48b6cfe683138f4a32c652534eafc06355c6cfb6 and address: 172.31.78.144 and node name: 172.31.78.144 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a      (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.

[2022-09-16 13:13:46,502 C 2116 2116] (raylet) node_manager.cc:999: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.

2. Multiple raylet instances on the dead node

3. The first raylet on the dead node received ShutdownRaylet while the others did not

/----------------/ raylet.out /----------------/

(base) ray@ip-172-31-78-144:/tmp/ray/session_latest/logs$ tail raylet.out 
[state-dump]    AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 170.201 us, total = 170.201 us
[state-dump] DebugString() time ms: 0
[state-dump] 
[state-dump] 
[2022-09-16 13:11:26,005 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:36,090 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:39,852 I 134 134] (raylet) accessor.cc:601: Notification for addition of a node that was already removed:f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a
[2022-09-16 13:11:46,098 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:56,177 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:57,152 I 134 134] (raylet) node_manager.cc:2005: Shutdown RPC has received. Shutdown will happen after the RPC is replied.

/----------------/ raylet.20.out and others /----------------/

(base) ray@ip-172-31-78-144:/tmp/ray/session_latest/logs$ tail raylet.20.out -n 25
[2022-09-16 13:32:58,673 I 4519 4519] (raylet) accessor.cc:608: Received notification for node id = 9fb1084d915f8fc8f1051a145944b11f954016ff63b5103534e1a6c3, IsAlive = 1
[2022-09-16 13:32:58,673 I 4519 4519] (raylet) accessor.cc:608: Received notification for node id = f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a, IsAlive = 0
[2022-09-16 13:32:58,673 I 4519 4519] (raylet) accessor.cc:608: Received notification for node id = 9773b3cedec83f2a48b6cfe683138f4a32c652534eafc06355c6cfb6, IsAlive = 0
[2022-09-16 13:32:58,716 C 4519 4519] (raylet) node_manager.cc:999: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x485f6a) [0x562f9b0e3f6a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487a42) [0x562f9b0e5a42] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487d57) [0x562f9b0e5d57] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x257943) [0x562f9aeb5943] ray::raylet::NodeManager::NodeRemoved()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3bf77f) [0x562f9b01d77f] ray::gcs::NodeInfoAccessor::HandleNotification()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3624ff) [0x562f9afc04ff] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3969b5) [0x562f9aff49b5] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3be520) [0x562f9b01c520] ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x363e85) [0x562f9afc1e85] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x21f665) [0x562f9ae7d665] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x469da6) [0x562f9b0c7da6] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x40d69e) [0x562f9b06b69e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x40db16) [0x562f9b06bb16] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x994b6b) [0x562f9b5f2b6b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996331) [0x562f9b5f4331] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996560) [0x562f9b5f4560] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x140ae4) [0x562f9ad9eae4] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fac99487083] __libc_start_main
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x17d777) [0x562f9addb777]

4. State API shows some of the actors RESTARTING on the dead node but the actors have exited based on the core-worker log which ran the actor

(base) ray:~/oss-release-tests-2% ray summary actors

======== Actors Summary: 2022-09-16 15:15:54.436892 ========
Stats:
------------------------------------
total_actors: 866

Table (group by class):
------------------------------------
    CLASS_NAME                                        STATE_COUNTS
0   run_actor_workload.<locals>.DBActor               ALIVE: 144
                                                      DEAD: 288
1   run_actor_workload.<locals>.ReportActor           ALIVE: 112
                                                      DEAD: 288
                                                      RESTARTING: 32
2   get_and_run_node_killer.<locals>.NodeKillerActor  ALIVE: 1
3   monitor_memory_usage.<locals>.MemoryMonitorActor  ALIVE: 1

One of the actor that's reported to be RESTARTING shows it's exiting:

[2022-09-16 13:11:39,853 I 1587 1838] accessor.cc:601: Notification for addition of a node that was already removed:f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a
[2022-09-16 13:11:57,538 W 1587 1838] core_worker.cc:759: Shutting down the core worker because the local raylet failed. Check out the raylet.out log file. Raylet pid: 134
(base) ray@ip-172-31-78-144:/tmp/ray/session_latest/logs$ tail python-core-worker-7d31747e1e42e3e04d9d76009305dad79fb4b2b08e6f308f0f6cbbd0_1587.log -n 20
        CoreWorkerDirectActorTaskSubmitter::SubmitTask - 600 total (0 active), CPU time: mean = 41.296 us, total = 24.777 ms
        NodeManagerService.grpc_client.ReportWorkerBacklog - 60 total (0 active), CPU time: mean = 7.648 us, total = 458.857 us
        CoreWorker.deadline_timer.flush_profiling_events - 60 total (1 active), CPU time: mean = 141.011 us, total = 8.461 ms
        CoreWorkerService.grpc_server.GetCoreWorkerStats - 55 total (0 active), CPU time: mean = 70.359 us, total = 3.870 ms
        PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 56.058 us, total = 336.348 us
        StatsGcsService.grpc_client.AddProfileData - 4 total (0 active), CPU time: mean = 307.860 us, total = 1.231 ms
        InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 3 total (1 active), CPU time: mean = 157.820 us, total = 473.459 us
        CoreWorkerService.grpc_server.LocalGC - 2 total (0 active), CPU time: mean = 43.893 ms, total = 87.785 ms
        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 101.222 us, total = 202.445 us
        ActorInfoGcsService.grpc_client.GetActorInfo - 1 total (0 active), CPU time: mean = 900.804 us, total = 900.804 us
        CoreWorkerService.grpc_server.PubsubCommandBatch - 1 total (0 active), CPU time: mean = 40.554 us, total = 40.554 us
        Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 1 total (0 active), CPU time: mean = 430.075 us, total = 430.075 us
        CoreWorkerService.grpc_server.PubsubLongPolling - 1 total (0 active), CPU time: mean = 16.557 us, total = 16.557 us
        NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 186.496 us, total = 186.496 us
        Subscriber.HandlePublishedMessage_GCS_NODE_INFO_CHANNEL - 1 total (0 active), CPU time: mean = 52.593 us, total = 52.593 us
        WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 12.875 us, total = 12.875 us

[2022-09-16 13:11:39,853 I 1587 1838] accessor.cc:601: Notification for addition of a node that was already removed:f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a
[2022-09-16 13:11:57,538 W 1587 1838] core_worker.cc:759: Shutting down the core worker because the local raylet failed. Check out the raylet.out log file. Raylet pid: 134

@rkooo567 @stephanie-wang @iycheng any thoughts?

rickyyx commented 1 year ago

We had a success pass though now: https://buildkite.com/ray-project/release-tests-branch/builds/1005#0183470b-bf6b-4bde-a27d-07e766210082

scv119 commented 1 year ago

hmm, this is interesting. on the node where raylet crashed, do we see high load in the state_dump? it could be the raylet is too busy for some reason and failed to send heartbeat in time.

stephanie-wang commented 1 year ago

Yes, seems like maybe there's an issue with the raylet coming under heavy load when restarting. The first raylet shutdown is normal, I think, since the test is purposefully killing nodes.

rickyyx commented 1 year ago

I see - that's possible. I think I glanced through it but wasn't paying close attention to it. Let me verify raylet's load when this happens

rickyyx commented 1 year ago

it seems to me that it's less about raylet under heavy load when restarting, but an inconsistent state between the started node + GCS.

So when a node gets restarted, here are what I think are happening (I am not 100% sure with all the sequence of the callbacks yet though)

The node tries to register itself with GCS
While registering with GCS, it subscribes to node changes (removal)
GCS notifies the node manager the node (itself) is removed.
Callback on the node removal notification triggered, with FATAL logging

So a couple of questions:

Is it expected that the same node gets restarted after the test killing it? @stephanie-wang @iycheng
- So the node is restarting with the same node id now because RAY_RAYLET_NODE_ID is set on the node even though it's restarted node. @scv119
What's the expected behaviour if a raylet gets a notification that the node it is on was removed (seems this is just stale data on GCS?) - should it log FATAL?

rickyyx commented 1 year ago

Synced with @iycheng offline, we are highly certain in the root cause and working with product side for a quick fix (revert):

Product has a PR that sets node id as env variable as a side effect of other features.
A newly started node in chaos test will have the same node id, thus the above states (GCS thought the node was removed, the restarted node (with the same id as the previously removed node) thought otherwise)

stephanie-wang commented 1 year ago

Thanks for looking into this, @rickyyx this definitely seems like the root cause.

For context, we should never reuse NodeIDs. The reason is that if you reuse NodeIDs across raylet restarts, it becomes a lot more complicated trying to figure out if a given raylet is alive or not. Actually it says this in the 1.x architecture:

Each worker process and raylet is assigned a unique 20-byte identifier and an IP address and port. The same address and port can be reused by subsequent components (e.g., if a previous worker process dies), but the unique IDs are never reused (i.e., they are tombstoned upon process death).

Normally we do this by randomly generating NodeIDs, so if the testing infrastructure is somehow pinning the NodeID on raylet start, then yes, it is definitely not expected to work.

Seems like this is probably the issue behind #28559 too, then.

Is it expected that the same node gets restarted after the test killing it? @stephanie-wang @iycheng So the node is restarting with the same node id now because RAY_RAYLET_NODE_ID is set on the node even though it's restarted node. @scv119

Yes, autoscaler will automatically try to restart nodes if some fail to fulfill the cluster config. This means it will try to reuse any live instances and rerun ray start on those instances.

What's the expected behaviour if a raylet gets a notification that the node it is on was removed (seems this is just stale data on GCS?) - should it log FATAL?

Yes, we never allow raylets to go from DEAD -> ALIVE.

rickyyx commented 1 year ago

Yeah, I think we will work with product on a much better protocol for this feature they are working on. Thanks for confirming! @stephanie-wang

rickyyx commented 1 year ago

Also on node ids, IIUC, I guess the requirement here from product might be a predictable node id that's not internally generated when the raylet starts.

Have we considered other sources of node id generation? E.g. could it be a monotonically increasing id-generator on GCS? cc @stephanie-wang @iycheng

Also, the current node id is so long but could actually be much shorter right? (@rkooo567 and I chatted about this before) Is there any requirement other than uniqueness so that the NodeID needs to be as long as 20 bytes?

rickyyx commented 1 year ago

Successful runs : https://b534fd88.us1a.app.preset.io/superset/dashboard/p/j6G4EY14keo/

ray-project / ray