ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.16k stars 5.8k forks source link

Object-store based function manager #18096

Open ericl opened 3 years ago

ericl commented 3 years ago

Currently, function and actor definitions are broadcast to all workers. This has a number of issues:

  1. CPU overhead on the GCS is O(n) with the size of the cluster when defining a function.
  2. Large functions are not supported.
  3. Function lifetime is not easily tracked (not ref-counted).

This has led to a number of user issues:

There is a short term mitigation to reduce the worst of these issues: https://github.com/ray-project/ray/issues/18095

However, longer term it makes sense to migrate the function manager to the object store. This involves a couple things:

  1. Function objects should be owned by the GCS, to support detached actors using remote functions.
  2. We should treat function objects as an object dependency of the task prior to dispatch.
  3. Modification of the function execution code to get functions from the object store instead of via redis.
DonYum commented 3 years ago

I'm confused by #8822 , so, when will this bug be fixed?

ericl commented 3 years ago

Let me re-open 8822, since it's looking like this solution may be too far out and we could do a shorter-term fix.

raulchen commented 3 years ago

As we are implementing RuntimeEnv, I think we should default to using RuntimeEnv to distribute code. The only exception is dynamically-defined functions. Does this make sense? Also, by switching to RuntimeEnv for all static functions. I guess these issues will be greatly mitigated? cc @edoakes

ericl commented 3 years ago

These functions are dynamically defined during the execution of the program though, and typically there are many for a single runtime_env. I don't think unifying these mechanisms makes sense.

Btw in Python there is no such thing as a static function really as relevant to Ray, everything is dynamically defined.

raulchen commented 3 years ago

I know that everything in Python is dynamic. By "static", I was referring to functions that don't capture any dynamic variables. If we distribute the python code files with runtime env, we don't have to store the functions in GCS or object store. This will be more efficient and easier to maintain.

ericl commented 3 years ago

That's definitely not the case for Python though. Remote functions commonly closure-capture both global variables and local variables, and hence have to be defined at runtime.

raulchen commented 3 years ago

But __globals__ and __locals__ are not used usually. Take the following function for example, it doesn't matter whether the function to execute in worker is pickled from the driver, or the worker loads the function from the local file.

@ray.remote
def foo():
    return 1

ray.get(foo.remote())
edoakes commented 3 years ago

@raulchen this is a very trivial example. Users capture globals/locals all the time, especially if dynamically defining things:

refs = []
for i in range(10):
    @ray.remote
    def process_chunk():
        return my_lib.process(i)
    refs.append(process_chunk.remote())

It may be considered "bad practice," but this is a very common usage pattern and it would hamper usability if we disallow it..

kfstorm commented 3 years ago

Do the objects survive GCS failover if we store functions in the object store and make GCS the owner?

ericl commented 3 years ago

Yes, I think we would be able to support lineage reconstruction in that case as well, if the GCS stored the lineage as well as metadata.

On Thu, Nov 18, 2021, 7:00 PM Kai Yang @.***> wrote:

Do the objects survive GCS failover if we store functions in the object store and make GCS the owner?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/18096#issuecomment-973683193, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSRGVTLD7J4JG2G7DSTUMW4TLANCNFSM5C2D2EBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

raulchen commented 3 years ago

@edoakes right. I was thinking that maybe we can detect whether a function uses a captured variable, and choose different ways to distribute the function.

After thinking more about this, global variable support is still problematic in some cases, which will confuse users. For example:

global_var = 0
@ray.remote
def foo():
    return global_var
if __module__ == "__main__":
    print(ray.get(foo.remote())  # Prints 0, as "global_var = 0" is captured in the remote function.
    global_var = 1
    print(ray.get(foo.remote()) # This still prints 1, unless we re-export the function.

Also, changing a global variable in a remote function doesn't take effect either.

I don't have a good solution at this moment. But I think, as long as the original code use global variables, the user has to be careful when migrating the code to Ray.

marsupialtail commented 2 years ago

Has there been updates on this thing? I am being hit with memory leaks on ray head when I kill stuff, and I can't call ray stop manually.

smacpher commented 2 years ago

Hi, I've also been running into my Ray head node's memory slowly growing over time as I run the same driver script (using the Ray Client) with different arguments over time. Are there any workarounds to make clean up resources at the end of my Ray Client script? Thanks in advance for your time.

mohitjain2504 commented 2 months ago

@marsupialtail @smacpher Were you able to identify any work around?