Open edoakes opened 1 year ago
Here are the full cluster logs for the release test failure run: logs.zip
One possible (unsubstantiated) theory: in this test, the controller and the actors it creates are killed repeatedly. It could be that the controller is killed immediately after creating a replica, in which case the raylet may not yet have marked the worker running the replica as being a detached actor.
Looks like we only mark the actor as detached after the creation task finishes: https://github.com/ray-project/ray/blob/f5c59745d00982835feb145d14d1f9e0d4b0db6c/src/ray/raylet/node_manager.cc#L2186
This means if the creation task is sent to the actor, then the owner dies before it finishes, the actor may be killed by HandleUnexpectedWorkerFailure.
While debugging a release test failure, we discovered that in some cases Serve replica actors are being killed due to fate sharing with the controller.
This should never happen because all actors started by Serve (the controller, replicas, proxies) are detached, so they should not fate share with the controller (relevant code in the raylet).
We see a number of log lines like the following in the Raylet logs in multiple runs of the Serve long-running failure test case:
All of the referenced actors are detached actors.