Open jfaust opened 3 months ago
Even with the Actor referenced in the above log completely removed, I still get:
127[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:4197: Force exiting worker that owns object. This may cause other workers that depends on the object to lose it. Own objects: 1 # Pins in flight: 0
128[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:835: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT, detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time.
129[2024-04-02 17:41:53,187 W 3724 3724] reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.
And the worker never exits, so the Actor bit may be a red herring.
A little more information, from looking at different logs. It appears that the worker that never exits is for the very top level task in our pipeline - it's a task that just submits a bunch of other tasks to run and then waits for results.
After compiling Ray from source and enabling debug logging in the core, I found a repro (added to the initial comment).
Looks like maybe it doesn't need to be returned to the Driver. This seems to reproduce as well:
import ray
import numpy as np
ray.init(namespace="test")
@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:
def do_work(self):
arr = np.zeros((10000, 1000))
np.multiply(arr, arr)
@ray.remote
def _create_actor():
return Actor.remote()
@ray.remote
def _toplevel():
actor = ray.get(_create_actor.remote())
tasks = [actor.do_work.remote() for _ in range(0, 100)]
ray.get(tasks)
ray.get(_toplevel.remote())
@jfaust what is the workaround? this is causing a lot of issues on our cluster
@jfaust what is the workaround? this is causing a lot of issues on our cluster
@bug-catcher in my case it was easy to create the Actor and pass it places and never return it. Not a workaround in many cases but worked for us.
What happened + What you expected to happen
EDIT: I've substantially changed this description based on having found a repro.
We have a pipeline with a detached actor that is shared among a number of jobs. That actor was being returned to the Driver script, somewhat coincidentally.
When you do that with a detached Actor, Ray will leak an IDLE worker each time you do.
If I run the repro below 3 times in a row, I end up with two of these:![image](https://github.com/ray-project/ray/assets/1388377/0f900154-e042-4493-88f2-2b9a93971513)
Note that the worker has no ID (which I believe means it's about to exit), but it never exits. Looking at the python-core-worker log for any of those workers, you'll see:
The worker never exits, even if the Actor is killed.
Versions / Dependencies
Reproduction script
Issue Severity
Medium: I have a workaround