Open tianyicui-tsy opened 2 years ago
Just saw this was pushed back to Ray 2.3. Is it possible to shed some light on the rationale behind the decision?
I'm not sure how often people run into this issue but I ran into it the first time I tried to cancel some tasks in Ray. And if my description of the bug is accurate, I'd argue the severity of the issue's result can be quite high: worker that aren't released even after the ray job stops.
Thanks and appreciate the quick triage of my issue!
@jjyao will do some initial investigation
Hi @tianyicui-tsy, we are working on several fixes for this. In the meantime, you can work around this by adding a try-except block in the fail()
task that returns the exception instead of raising it.
Thank you @vitsai, it's great to hear that fixes are being worked on. Really appreciate that. Just want to point out that while your suggested workaround works for this simplified example, it's not really feasible in our production environment. In our production code the tasks that raise exceptions are much more complicated than fail()
. It would be infeasable to change all of them by returning exceptions, for example we'd need to change all places that use the results of these tasks to firstly find out whether an exception was returned.
What happened + What you expected to happen
When task
work
depends onfail
(which immediately raise an exception) andinf_loop
(which runs infinite loop), cancelingwork
doesn't recursively cancelinf_loop
. Result is a worker with no way to cancel or kill. What's more, shutdown the ray job doesn't seem to stop theinf_loop
worker, as a result, there's no way to stop theinf_loop
worker (in practice it should be an arbitrary long-running task) and reclaim its resources. So I believe it's accurate to describe its state as leaked.Versions / Dependencies
I tried both Ray 2.1.0 and nightly Python 3.10.6 Ubuntu 22.04
Reproduction script
If I run a ray cluster locally with
ray start --head
, and run the script below withRAY_ADDRESS=auto python test.py
, I see that even after the script finishes (i.e. after the job shutdown), theinf_loop
worker is still running and there is one less available CPU according toray.available_resources
from another job.Issue Severity
High: It blocks me from completing my task.