Open jfaust-fy opened 3 weeks ago
Hi @jfaust-fy so it's not that the JobSupervisor is stuck at DEPENDENCIES_UNREADY state right? Eventually it will become RUNNING (after 30s+)?
@jjyao I don't have a cluster in this state to check on, but I don't believe so. In this screenshot there were 76 JobSupervisor
s in that state, but we only submit a maximum of 7 or 8 jobs in parallel.
What happened + What you expected to happen
We have some long-running Ray clusters that we submit jobs to via the Job Submission API. Occasionally, we see jobs start to take longer than they should, with the overhead being in job submission itself (normally submission takes a few seconds, in these cases it starts to take 30s+). If I look at the Actors list in these situations, I see: Note the 76
JobSupervisor
actors in theDEPENDENCIES_UNREADY
state. In this screenshot the cluster has been running for 1 month.Versions / Dependencies
Ray 2.24 Python 3.11 Ubuntu 20.04
Reproduction script
I'm not sure how to create a repro script for this - the cluster it just happened on had been running for 1 month.
Issue Severity
Medium: It is a significant difficulty but I can work around it.