Open ericl opened 3 years ago
I'm confused by #8822 , so, when will this bug be fixed?
Let me re-open 8822, since it's looking like this solution may be too far out and we could do a shorter-term fix.
As we are implementing RuntimeEnv, I think we should default to using RuntimeEnv to distribute code. The only exception is dynamically-defined functions. Does this make sense? Also, by switching to RuntimeEnv for all static functions. I guess these issues will be greatly mitigated? cc @edoakes
These functions are dynamically defined during the execution of the program though, and typically there are many for a single runtime_env. I don't think unifying these mechanisms makes sense.
Btw in Python there is no such thing as a static function really as relevant to Ray, everything is dynamically defined.
I know that everything in Python is dynamic. By "static", I was referring to functions that don't capture any dynamic variables. If we distribute the python code files with runtime env, we don't have to store the functions in GCS or object store. This will be more efficient and easier to maintain.
That's definitely not the case for Python though. Remote functions commonly closure-capture both global variables and local variables, and hence have to be defined at runtime.
But __globals__
and __locals__
are not used usually. Take the following function for example, it doesn't matter whether the function to execute in worker is pickled from the driver, or the worker loads the function from the local file.
@ray.remote
def foo():
return 1
ray.get(foo.remote())
@raulchen this is a very trivial example. Users capture globals/locals all the time, especially if dynamically defining things:
refs = []
for i in range(10):
@ray.remote
def process_chunk():
return my_lib.process(i)
refs.append(process_chunk.remote())
It may be considered "bad practice," but this is a very common usage pattern and it would hamper usability if we disallow it..
Do the objects survive GCS failover if we store functions in the object store and make GCS the owner?
Yes, I think we would be able to support lineage reconstruction in that case as well, if the GCS stored the lineage as well as metadata.
On Thu, Nov 18, 2021, 7:00 PM Kai Yang @.***> wrote:
Do the objects survive GCS failover if we store functions in the object store and make GCS the owner?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/18096#issuecomment-973683193, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSRGVTLD7J4JG2G7DSTUMW4TLANCNFSM5C2D2EBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@edoakes right. I was thinking that maybe we can detect whether a function uses a captured variable, and choose different ways to distribute the function.
After thinking more about this, global variable support is still problematic in some cases, which will confuse users. For example:
global_var = 0
@ray.remote
def foo():
return global_var
if __module__ == "__main__":
print(ray.get(foo.remote()) # Prints 0, as "global_var = 0" is captured in the remote function.
global_var = 1
print(ray.get(foo.remote()) # This still prints 1, unless we re-export the function.
Also, changing a global variable in a remote function doesn't take effect either.
I don't have a good solution at this moment. But I think, as long as the original code use global variables, the user has to be careful when migrating the code to Ray.
Has there been updates on this thing? I am being hit with memory leaks on ray head when I kill stuff, and I can't call ray stop manually.
Hi, I've also been running into my Ray head node's memory slowly growing over time as I run the same driver script (using the Ray Client) with different arguments over time. Are there any workarounds to make clean up resources at the end of my Ray Client script? Thanks in advance for your time.
@marsupialtail @smacpher Were you able to identify any work around?
Currently, function and actor definitions are broadcast to all workers. This has a number of issues:
This has led to a number of user issues:
There is a short term mitigation to reduce the worst of these issues: https://github.com/ray-project/ray/issues/18095
However, longer term it makes sense to migrate the function manager to the object store. This involves a couple things: