Closed xieus closed 11 months ago
Thanks for the report - the placement group should get cleaned up alongside the replica. If that's not happening, it could suggest a Ray Core bug.
We will be switching to native Ray Serve placement group support soon - curious to see if that fixes the issue.
@scv119 @Yard1 based on my understanding, there is no GC for placement groups and the expectation is always that PG creators should release the resource. Pls create me if wrong.
Not sure native Ray Serve placement groups will handle auto placement group cleanup. If it does, the issue should be gone.
Symptoms: When a user script hits an exception, the associated RayWorker actor is marked as dead, however the node that hosted those actors can't be scaled down. Even if there are no actors left (GPU, CPU, and memory all come down to zero), the worker node can't be removed because there are some place groups left.
Theory here is that Aviary uses Serve API to create a place group while it doesn't release a placement group when the actor dies. As a result, the place group is leaked and blocks the termination of an idle node. Note that there is no GC for placement groups and the expectation is always that callers should release the resource.
Reproduce Env: Aviary version: 0.2.1 Ray version: nightly Ray dashboard: https://session-xtfeimv54hk6bt23g5lc9eputm.i.anyscaleuserdata-staging.com/#/cluster Cluster url: https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_zvyp4jhsu8g9in8j5t7cwf1c1d/clusters/ses_xtfeimv54hk6bt23g5lc9eputm?user=usr_b9yhdfc2syn6sx3wiqvyw1tzc2