Closed stephanie-wang closed 1 year ago
As far as I can tell, no core changes made between these commits, but it's now failed twice in a row.
Rerunning failed as well. Running it manually here https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_hwhsTD54MJmfFjY4Q4jjW72Q?command-history-section=command_history
so that I could get access to the gcs logs.
Still haven't been able to ping down the root cause yet, essentially the test gets stuck at a weird states with GCS being alive:
1. Looping of reporting dead raylet on the same dead node
2022-09-16 15:06:49,327 WARNING worker.py:1828 -- The node with node id: 9773b3cedec83f2a48b6cfe683138f4a32c652534eafc06355c6cfb6 and address: 172.31.78.144 and node name: 172.31.78.144 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
[2022-09-16 13:13:46,502 C 2116 2116] (raylet) node_manager.cc:999: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
2. Multiple raylet instances on the dead node
3. The first raylet on the dead node received ShutdownRaylet while the others did not
/----------------/ raylet.out /----------------/
(base) ray@ip-172-31-78-144:/tmp/ray/session_latest/logs$ tail raylet.out
[state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 170.201 us, total = 170.201 us
[state-dump] DebugString() time ms: 0
[state-dump]
[state-dump]
[2022-09-16 13:11:26,005 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:36,090 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:39,852 I 134 134] (raylet) accessor.cc:601: Notification for addition of a node that was already removed:f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a
[2022-09-16 13:11:46,098 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:56,177 I 134 134] (raylet) node_manager.cc:661: Sending Python GC request to 16 local workers to clean up Python cyclic references.
[2022-09-16 13:11:57,152 I 134 134] (raylet) node_manager.cc:2005: Shutdown RPC has received. Shutdown will happen after the RPC is replied.
/----------------/ raylet.20.out and others /----------------/
(base) ray@ip-172-31-78-144:/tmp/ray/session_latest/logs$ tail raylet.20.out -n 25
[2022-09-16 13:32:58,673 I 4519 4519] (raylet) accessor.cc:608: Received notification for node id = 9fb1084d915f8fc8f1051a145944b11f954016ff63b5103534e1a6c3, IsAlive = 1
[2022-09-16 13:32:58,673 I 4519 4519] (raylet) accessor.cc:608: Received notification for node id = f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a, IsAlive = 0
[2022-09-16 13:32:58,673 I 4519 4519] (raylet) accessor.cc:608: Received notification for node id = 9773b3cedec83f2a48b6cfe683138f4a32c652534eafc06355c6cfb6, IsAlive = 0
[2022-09-16 13:32:58,716 C 4519 4519] (raylet) node_manager.cc:999: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x485f6a) [0x562f9b0e3f6a] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487a42) [0x562f9b0e5a42] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x487d57) [0x562f9b0e5d57] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x257943) [0x562f9aeb5943] ray::raylet::NodeManager::NodeRemoved()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3bf77f) [0x562f9b01d77f] ray::gcs::NodeInfoAccessor::HandleNotification()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3624ff) [0x562f9afc04ff] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3969b5) [0x562f9aff49b5] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3be520) [0x562f9b01c520] ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x363e85) [0x562f9afc1e85] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x21f665) [0x562f9ae7d665] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x469da6) [0x562f9b0c7da6] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x40d69e) [0x562f9b06b69e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x40db16) [0x562f9b06bb16] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x994b6b) [0x562f9b5f2b6b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996331) [0x562f9b5f4331] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x996560) [0x562f9b5f4560] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x140ae4) [0x562f9ad9eae4] main
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fac99487083] __libc_start_main
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x17d777) [0x562f9addb777]
4. State API shows some of the actors RESTARTING on the dead node but the actors have exited based on the core-worker log which ran the actor
(base) ray:~/oss-release-tests-2% ray summary actors
======== Actors Summary: 2022-09-16 15:15:54.436892 ========
Stats:
------------------------------------
total_actors: 866
Table (group by class):
------------------------------------
CLASS_NAME STATE_COUNTS
0 run_actor_workload.<locals>.DBActor ALIVE: 144
DEAD: 288
1 run_actor_workload.<locals>.ReportActor ALIVE: 112
DEAD: 288
RESTARTING: 32
2 get_and_run_node_killer.<locals>.NodeKillerActor ALIVE: 1
3 monitor_memory_usage.<locals>.MemoryMonitorActor ALIVE: 1
One of the actor that's reported to be RESTARTING shows it's exiting:
[2022-09-16 13:11:39,853 I 1587 1838] accessor.cc:601: Notification for addition of a node that was already removed:f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a
[2022-09-16 13:11:57,538 W 1587 1838] core_worker.cc:759: Shutting down the core worker because the local raylet failed. Check out the raylet.out log file. Raylet pid: 134
(base) ray@ip-172-31-78-144:/tmp/ray/session_latest/logs$ tail python-core-worker-7d31747e1e42e3e04d9d76009305dad79fb4b2b08e6f308f0f6cbbd0_1587.log -n 20
CoreWorkerDirectActorTaskSubmitter::SubmitTask - 600 total (0 active), CPU time: mean = 41.296 us, total = 24.777 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 60 total (0 active), CPU time: mean = 7.648 us, total = 458.857 us
CoreWorker.deadline_timer.flush_profiling_events - 60 total (1 active), CPU time: mean = 141.011 us, total = 8.461 ms
CoreWorkerService.grpc_server.GetCoreWorkerStats - 55 total (0 active), CPU time: mean = 70.359 us, total = 3.870 ms
PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 56.058 us, total = 336.348 us
StatsGcsService.grpc_client.AddProfileData - 4 total (0 active), CPU time: mean = 307.860 us, total = 1.231 ms
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 3 total (1 active), CPU time: mean = 157.820 us, total = 473.459 us
CoreWorkerService.grpc_server.LocalGC - 2 total (0 active), CPU time: mean = 43.893 ms, total = 87.785 ms
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 101.222 us, total = 202.445 us
ActorInfoGcsService.grpc_client.GetActorInfo - 1 total (0 active), CPU time: mean = 900.804 us, total = 900.804 us
CoreWorkerService.grpc_server.PubsubCommandBatch - 1 total (0 active), CPU time: mean = 40.554 us, total = 40.554 us
Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 1 total (0 active), CPU time: mean = 430.075 us, total = 430.075 us
CoreWorkerService.grpc_server.PubsubLongPolling - 1 total (0 active), CPU time: mean = 16.557 us, total = 16.557 us
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 186.496 us, total = 186.496 us
Subscriber.HandlePublishedMessage_GCS_NODE_INFO_CHANNEL - 1 total (0 active), CPU time: mean = 52.593 us, total = 52.593 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 12.875 us, total = 12.875 us
[2022-09-16 13:11:39,853 I 1587 1838] accessor.cc:601: Notification for addition of a node that was already removed:f3b1fb42a3eade0d548408eb441bfae6ae0b70d7ce202706ccf81f6a
[2022-09-16 13:11:57,538 W 1587 1838] core_worker.cc:759: Shutting down the core worker because the local raylet failed. Check out the raylet.out log file. Raylet pid: 134
@rkooo567 @stephanie-wang @iycheng any thoughts?
We had a success pass though now: https://buildkite.com/ray-project/release-tests-branch/builds/1005#0183470b-bf6b-4bde-a27d-07e766210082
hmm, this is interesting. on the node where raylet crashed, do we see high load in the state_dump? it could be the raylet is too busy for some reason and failed to send heartbeat in time.
Yes, seems like maybe there's an issue with the raylet coming under heavy load when restarting. The first raylet shutdown is normal, I think, since the test is purposefully killing nodes.
I see - that's possible. I think I glanced through it but wasn't paying close attention to it. Let me verify raylet's load when this happens
it seems to me that it's less about raylet under heavy load when restarting, but an inconsistent state between the started node + GCS.
So when a node gets restarted, here are what I think are happening (I am not 100% sure with all the sequence of the callbacks yet though)
So a couple of questions:
RAY_RAYLET_NODE_ID
is set on the node even though it's restarted node. @scv119 Synced with @iycheng offline, we are highly certain in the root cause and working with product side for a quick fix (revert):
Thanks for looking into this, @rickyyx this definitely seems like the root cause.
For context, we should never reuse NodeIDs. The reason is that if you reuse NodeIDs across raylet restarts, it becomes a lot more complicated trying to figure out if a given raylet is alive or not. Actually it says this in the 1.x architecture:
Each worker process and raylet is assigned a unique 20-byte identifier and an IP address and port. The same address and port can be reused by subsequent components (e.g., if a previous worker process dies), but the unique IDs are never reused (i.e., they are tombstoned upon process death).
Normally we do this by randomly generating NodeIDs, so if the testing infrastructure is somehow pinning the NodeID on raylet start, then yes, it is definitely not expected to work.
Seems like this is probably the issue behind #28559 too, then.
Is it expected that the same node gets restarted after the test killing it? @stephanie-wang @iycheng So the node is restarting with the same node id now because RAY_RAYLET_NODE_ID is set on the node even though it's restarted node. @scv119
Yes, autoscaler will automatically try to restart nodes if some fail to fulfill the cluster config. This means it will try to reuse any live instances and rerun ray start
on those instances.
What's the expected behaviour if a raylet gets a notification that the node it is on was removed (seems this is just stale data on GCS?) - should it log FATAL?
Yes, we never allow raylets to go from DEAD -> ALIVE.
Yeah, I think we will work with product on a much better protocol for this feature they are working on. Thanks for confirming! @stephanie-wang
Also on node ids, IIUC, I guess the requirement here from product might be a predictable node id that's not internally generated when the raylet starts.
Have we considered other sources of node id generation? E.g. could it be a monotonically increasing id-generator on GCS? cc @stephanie-wang @iycheng
Also, the current node id is so long but could actually be much shorter right? (@rkooo567 and I chatted about this before) Is there any requirement other than uniqueness so that the NodeID needs to be as long as 20 bytes?
Successful runs : https://b534fd88.us1a.app.preset.io/superset/dashboard/p/j6G4EY14keo/
What happened + What you expected to happen
command_scd_W8xe3jzdaCEE1qykk7UwiMEk.log
Nodes dying:
And later GCS crashing or runreachable:
Versions / Dependencies
Good commit: 7ba37885c67844d2eb18a63dcf3b7ac7f66ce89f Bad commit: b878fd28d25f17d5bf23563f73c4ab19b6ab625e
Reproduction script
-
Issue Severity
No response