Open rynewang opened 6 months ago
Another bug found when implementing the ActorUnavailableError. Good new is it's not blocking that PR. cc @jjyao @stephanie-wang
Guy, this is not a bug
import sys
import time
@ray.remote(max_restarts=1,max_task_retries=1)
class A:
def __init__(self):
print(f'initing')
time.sleep(2)
def kill(self):
sys.exit(-1)
def ping(self, msg):
return f"hello {msg}"
a = A.remote()
print(ray.get(a.ping.remote("ok")))
print(a.kill.options(max_task_retries=0).remote())
# raises RayActorError
# print(ray.get(a.ping.remote("no retries")))
# raises RayActorError
print(ray.get(a.ping.options(max_task_retries=1).remote("task retries")))
# ok
# print(ray.get(a.ping.options(max_task_retries=-1).remote("task retries")))
# max_task_retries = 3 raises, = 4 is ok
What happened + What you expected to happen
According to https://github.com/ray-project/ray/pull/22818 we have this semantics:
However on nightly, when actor is restarting and max_task_retries != 0, the task still fast fails and eats 1 retry count, and then retry. If the retry is -1 or large enough it can eventually run; but if the retry is small, say 1, we still get the RayActorError saying the actor is in restarting.
Versions / Dependencies
master
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.