Placement group is not released when user code exists, resulting in a resource leak

xieus commented 1 year ago

Symptoms: When a user script hits an exception, the associated RayWorker actor is marked as dead, however the node that hosted those actors can't be scaled down. Even if there are no actors left (GPU, CPU, and memory all come down to zero), the worker node can't be removed because there are some place groups left.

Theory here is that Aviary uses Serve API to create a place group while it doesn't release a placement group when the actor dies. As a result, the place group is leaked and blocks the termination of an idle node. Note that there is no GC for placement groups and the expectation is always that callers should release the resource.

Reproduce Env: Aviary version: 0.2.1 Ray version: nightly Ray dashboard: https://session-xtfeimv54hk6bt23g5lc9eputm.i.anyscaleuserdata-staging.com/#/cluster Cluster url: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_zvyp4jhsu8g9in8j5t7cwf1c1d/clusters/ses_xtfeimv54hk6bt23g5lc9eputm?user=usr_b9yhdfc2syn6sx3wiqvyw1tzc2

Yard1 commented 1 year ago

Thanks for the report - the placement group should get cleaned up alongside the replica. If that's not happening, it could suggest a Ray Core bug.

We will be switching to native Ray Serve placement group support soon - curious to see if that fixes the issue.

xieus commented 1 year ago

@scv119 @Yard1 based on my understanding, there is no GC for placement groups and the expectation is always that PG creators should release the resource. Pls create me if wrong.

Not sure native Ray Serve placement groups will handle auto placement group cleanup. If it does, the issue should be gone.

ray-project / ray-llm

Placement group is not released when user code exists, resulting in a resource leak #51