ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.53k stars 5.69k forks source link

[core] Check if a ray task has errored without calling `ray.get` on it #45229

Open justinvyu opened 5 months ago

justinvyu commented 5 months ago

Description

Goal: From a list of ray remote task futures, I want to be able to check if each of these has errored without needing to call ray.get individually on each element.

This feature is offered by similar async execution APIs:

Current workaround

We have a "check for failure" function in Ray Train, which may incur some unnecessary overhead to fetch objects: https://github.com/ray-project/ray/blob/fa61109f3fd26c543ad9a36794c8a478bc0a7113/python/ray/train/_internal/utils.py#L49-L58

Use case

I am implementing a control loop where I want to check on the status of some actor tasks every N seconds. I want to know if these actor tasks have failed as soon as possible so I can trigger some error handling. This involves me running an "error check" in a loop with a small amount of sleep time:

while True:
    ready, remaining = ray.wait(tasks, num_returns=len(tasks), timeout=0.01)

    # I want to be able to collect errored tasks without calling ray.get.
    # I want to distinguish successful tasks vs. errored tasks from the output from ray.wait.
    errors = []
    for task in ready:
        try:
            ray.get(task)
        except Exception as e:
            errors.append(e)

cc: @jjyao @rkooo567

rynewang commented 1 week ago

Currently an object does not distinguish on good result / errored result. If we are gonna do this we need an extra marker on the Wait request reply for e.g. ProcessWaitRequestMessage. But it looks like the Wait backend is still ObjectManagerService.Pull which does not say it's a Wait or a Get. @rkooo567 do Wait calls make the raylet receive data?