ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.3k stars 5.5k forks source link

[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438

Open jfaust opened 3 months ago

jfaust commented 3 months ago

What happened + What you expected to happen

EDIT: I've substantially changed this description based on having found a repro.

We have a pipeline with a detached actor that is shared among a number of jobs. That actor was being returned to the Driver script, somewhat coincidentally.

When you do that with a detached Actor, Ray will leak an IDLE worker each time you do.

If I run the repro below 3 times in a row, I end up with two of these: image

Note that the worker has no ID (which I believe means it's about to exit), but it never exits. Looking at the python-core-worker log for any of those workers, you'll see:

reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.

The worker never exits, even if the Actor is killed.

Versions / Dependencies

Reproduction script

import ray
import numpy as np

ray.init(namespace="test")

@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:

    def do_work(self):
        arr = np.zeros((10000, 1000))
        np.multiply(arr, arr)

@ray.remote
def _toplevel():
    actor = Actor.remote()
    tasks = [actor.do_work.remote() for _ in range(0, 100)]
    ray.get(tasks)
    return actor

actor = ray.get(_toplevel.remote())

Issue Severity

Medium: I have a workaround

jfaust commented 3 months ago

Even with the Actor referenced in the above log completely removed, I still get:

127[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:4197: Force exiting worker that owns object. This may cause other workers that depends on the object to lose it. Own objects: 1 # Pins in flight: 0
128[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:835: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT, detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time.
129[2024-04-02 17:41:53,187 W 3724 3724] reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.

And the worker never exits, so the Actor bit may be a red herring.

jfaust commented 3 months ago

A little more information, from looking at different logs. It appears that the worker that never exits is for the very top level task in our pipeline - it's a task that just submits a bunch of other tasks to run and then waits for results.

jfaust commented 3 months ago

After compiling Ray from source and enabling debug logging in the core, I found a repro (added to the initial comment).

jfaust commented 3 months ago

Looks like maybe it doesn't need to be returned to the Driver. This seems to reproduce as well:

import ray
import numpy as np

ray.init(namespace="test")

@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:

    def do_work(self):
        arr = np.zeros((10000, 1000))
        np.multiply(arr, arr)

@ray.remote
def _create_actor():
    return Actor.remote()

@ray.remote
def _toplevel():
    actor = ray.get(_create_actor.remote())
    tasks = [actor.do_work.remote() for _ in range(0, 100)]
    ray.get(tasks)

ray.get(_toplevel.remote())
bug-catcher commented 3 days ago

@jfaust what is the workaround? this is causing a lot of issues on our cluster

jfaust commented 3 days ago

@jfaust what is the workaround? this is causing a lot of issues on our cluster

@bug-catcher in my case it was easy to create the Actor and pass it places and never return it. Not a workaround in many cases but worked for us.