ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.84k stars 5.75k forks source link

[core] When actor restarting and max_task_retries != 0, task fast failed and consumed retry counts. #44719

Open rynewang opened 6 months ago

rynewang commented 6 months ago

What happened + What you expected to happen

According to https://github.com/ray-project/ray/pull/22818 we have this semantics:

To solve this issue, we change the semantics of max_task_retries.

  • When max_task_retries is 0 (which is the default value), if the callee actor is in the RESTARTING state, subsequently submitted tasks will fail immediately with a RayActorError. Users can catch the RayActorError and implement their own fallback strategies to improve service availability and mitigate service outages.
  • When max_task_retries is not 0, subsequently submitted tasks will be queued on the caller side and we only send them to the callee when the callee actor is back to the ALIVE state.

However on nightly, when actor is restarting and max_task_retries != 0, the task still fast fails and eats 1 retry count, and then retry. If the retry is -1 or large enough it can eventually run; but if the retry is small, say 1, we still get the RayActorError saying the actor is in restarting.

Versions / Dependencies

master

Reproduction script

import ray,time,sys
print(ray.__commit__) # last night's nightly
ray.init()

@ray.remote(max_restarts=1)
class A:
    def __init__(self):
        print(f'initing')
        time.sleep(2)

    def kill(self):
        sys.exit(-1)

    def ping(self, msg):
        return f"hello {msg}"

a = A.remote()
print(ray.get(a.ping.remote("ok")))
print(a.kill.remote())

# raises RayActorError
print(ray.get(a.ping.remote("no retries")))
# raises RayActorError
print(ray.get(a.ping.options(max_task_retries=1).remote("task retries")))
# ok
print(ray.get(a.ping.options(max_task_retries=-1).remote("task retries")))
# max_task_retries = 3 raises, = 4 is ok

Issue Severity

Medium: It is a significant difficulty but I can work around it.

rynewang commented 6 months ago

Another bug found when implementing the ActorUnavailableError. Good new is it's not blocking that PR. cc @jjyao @stephanie-wang

982945902 commented 6 months ago

Guy, this is not a bug

import sys
import time

@ray.remote(max_restarts=1,max_task_retries=1)
class A:
    def __init__(self):
        print(f'initing')
        time.sleep(2)

    def kill(self):
        sys.exit(-1)

    def ping(self, msg):
        return f"hello {msg}"

a = A.remote()
print(ray.get(a.ping.remote("ok")))
print(a.kill.options(max_task_retries=0).remote())

# raises RayActorError
# print(ray.get(a.ping.remote("no retries")))
# raises RayActorError
print(ray.get(a.ping.options(max_task_retries=1).remote("task retries")))
# ok
# print(ray.get(a.ping.options(max_task_retries=-1).remote("task retries")))
# max_task_retries = 3 raises, = 4 is ok