Open hongpeng-guo opened 2 months ago
import ray
from typing import Callable
@ray.remote(num_cpus=1)
class TestClass:
def execute(self, fn: Callable[..., None]) -> None:
return fn()
def exit(self):
ray.actor.exit_actor()
def dummy_func():
import time
print (100)
time.sleep(100)
actors = [TestClass.remote() for _ in range(10)]
ray.get([actor.__ray_ready__.remote() for actor in actors])
refs = []
for actor in actors[:2]:
refs.append(actor.exit.remote())
# ray.kill(actor)
try:
ray.get(refs)
except Exception:
pass
tasks = [actor.execute.remote(dummy_func) for actor in actors]
ray.get(tasks)
the main issue is that if the actor is crashed, and if you do ray.get on crashed/uncrashed actors together, ray.get doesn't raise an exception until uncrashed actors are finished. We can easily get around this in train layer (by using ray.wait), and
unassign myself now as it is mitigated
See here for a more consistent reproduction: https://github.com/anyscale/runtime/pull/929/files#diff-1913713e052df41064554b30df8e0f47abef67dee769c9de607e7728ef2e4d40R397
it is not a blocker, but let's fix this soon. the semantic is very bad for fault tolerant cases
Expected behavior: the crashed actor's any pending tasks raise ActorDiedError on ray.get(obj)
, or ray.get([obj, other_objs])
. should not hang
What happened + What you expected to happen
We recently met a problem of running functions on a died actor (because the underlying node is killed). The expectation is an exception could be raised immediately. However, it seems the code just hangs. The behavior is a bit flaky, but is hangs in most cases. This problem might be a regression as the same code raise as expected in early July.
Mini repro appended below.
Versions / Dependencies
2.34
Reproduction script
Issue Severity
High: It blocks me from completing my task.