ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.16k stars 5.8k forks source link

[Core] Long-running cluster ends up with lots of JobSupervisor actors in the DEPENDENCIES_UNREADY state #48452

Open jfaust-fy opened 3 weeks ago

jfaust-fy commented 3 weeks ago

What happened + What you expected to happen

We have some long-running Ray clusters that we submit jobs to via the Job Submission API. Occasionally, we see jobs start to take longer than they should, with the overhead being in job submission itself (normally submission takes a few seconds, in these cases it starts to take 30s+). If I look at the Actors list in these situations, I see: image Note the 76 JobSupervisor actors in the DEPENDENCIES_UNREADY state. In this screenshot the cluster has been running for 1 month.

Versions / Dependencies

Ray 2.24 Python 3.11 Ubuntu 20.04

Reproduction script

I'm not sure how to create a repro script for this - the cluster it just happened on had been running for 1 month.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jjyao commented 1 week ago

Hi @jfaust-fy so it's not that the JobSupervisor is stuck at DEPENDENCIES_UNREADY state right? Eventually it will become RUNNING (after 30s+)?

jfaust-fy commented 1 week ago

@jjyao I don't have a cluster in this state to check on, but I don't believe so. In this screenshot there were 76 JobSupervisors in that state, but we only submit a maximum of 7 or 8 jobs in parallel.