ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.24k stars 5.81k forks source link

[core] Detached actor being killed when its parent actor crashes #40864

Open edoakes opened 1 year ago

edoakes commented 1 year ago

While debugging a release test failure, we discovered that in some cases Serve replica actors are being killed due to fate sharing with the controller.

This should never happen because all actors started by Serve (the controller, replicas, proxies) are detached, so they should not fate share with the controller (relevant code in the raylet).

We see a number of log lines like the following in the Raylet logs in multiple runs of the Serve long-running failure test case:

[2023-11-01 07:10:17,273 I 825 825] (raylet) node_manager.cc:1104: The leased worker dd7d4d82da8fef21e59667dba16f2bce15203c8832039284cbb26461 is killed because the owner process 2b60b506544d378c192b7e1cbf989be4058f41015c00fa7f30e50f91 died.
[2023-11-01 07:10:17,273 I 825 825] (raylet) node_manager.cc:1104: The leased worker 3844620037d1ea4a19c830bb548edd9726cd4521cc78f2c7871367d6 is killed because the owner process 2b60b506544d378c192b7e1cbf989be4058f41015c00fa7f30e50f91 died.

All of the referenced actors are detached actors.

edoakes commented 1 year ago

Here are the full cluster logs for the release test failure run: logs.zip

edoakes commented 1 year ago

One possible (unsubstantiated) theory: in this test, the controller and the actors it creates are killed repeatedly. It could be that the controller is killed immediately after creating a replica, in which case the raylet may not yet have marked the worker running the replica as being a detached actor.

edoakes commented 1 year ago

Looks like we only mark the actor as detached after the creation task finishes: https://github.com/ray-project/ray/blob/f5c59745d00982835feb145d14d1f9e0d4b0db6c/src/ray/raylet/node_manager.cc#L2186

This means if the creation task is sent to the actor, then the owner dies before it finishes, the actor may be killed by HandleUnexpectedWorkerFailure.