ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Core][Actors] Duplicate named actor exception should not be lazy if possible #44906

Open glindstr opened 4 months ago

glindstr commented 4 months ago

What happened + What you expected to happen

Ray enforces the uniqueness of named actors within a namespace which is a great feature. I ran into an unexpected issue where an exception for creating a duplicate named actor doesn't present itself until the duplicate actor is interacted with. Is it feasible to make the exception thrown when creating a duplicate named remote actor?

Versions / Dependencies

Ray 2.10.0

Reproduction script

import ray
import time

@ray.remote
class TestActor:       
    def do_work(self):
        time.sleep(5)

ray.init(address="...")

name = "test"
namespace = 'namespace'

actor1 = TestActor.options(name=name, namespace=namespace, lifetime='detached').remote()
actor2 = TestActor.options(name=name, namespace=namespace, lifetime='detached').remote()

time.sleep(10)

actor1.do_work.remote()
actor2.do_work.remote() #Exception: Actor name already taken

Issue Severity

Low: It annoys or frustrates me.

glindstr commented 4 months ago

I see now that ray.actor.ActorClass.options has "get_if_exists" parameter which makes this potential bug a non-issue:

actor1 = TestActor.options(name=name, namespace=namespace, lifetime='detached', get_if_exists=True).remote()

Ideally this parameter is added to the api documentation:

https://docs.ray.io/en/latest/ray-core/api/doc/ray.actor.ActorClass.options.html#ray.actor.ActorClass.options

karta1502545 commented 3 months ago

Hi, I would like to take on this issue. I'm new to this project and open source contributions, but I'm eager to learn and contribute. Thanks!

prithvi081099 commented 1 month ago

Hi, I want to contribute to this issue and wanted to know if this issue is already fixed or not. I tried recreating the issue but it is working properly as expected, as shown below

image

and the code for this fixing, monitoring whether the name is already present in namespace also exists, as shown below(the code is present in master and release/2.10.0)

image

So wanted to confirm if the issue needed a fix or is already fixed and closed

glindstr commented 1 month ago

Interesting. I tried recreating the issue as well now on 2.32.0 and it appears resolved. That being said I've recently run into some sort of race condition bug where get_if_exists=True can fail. I will see if I can reproduce it. If so I'll create a new issue and will link to it from here.

glindstr commented 1 month ago

@prithvi081099 It caught my attention in your post the code snippet is from version 2.10.0 and I found the bug on version 2.10.0. I found the bug still exists but there is an additional requirement. The bug pops up when the head doesn't have any cpu resources and the actor requires a cpu resource.

Start ray locally without cpu resources: ray start --head --num-cpus=0

import ray
import time

@ray.remote
class TestActor:       
    def do_work(self):
        time.sleep(5)

ray.init()

name = "test"
namespace = 'namespace'

actor1 = TestActor.options(name=name, namespace=namespace, lifetime='detached').remote()
actor2 = TestActor.options(name=name, namespace=namespace, lifetime='detached').remote()

time.sleep(10)

actor1.do_work.remote()
actor2.do_work.remote() #Exception: Actor name already taken

Running a local head with no cpu resources seems contrived but it is the simplest code to reproduce the error. In production the autoscaler is spinning up very large pools of workers.

Were you able to reproduce the bug?

prithvi081099 commented 1 month ago

Let me try reproducing the error. I will have updates once done