skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Core] Ray job refused to submit jobs in PENDING status #4260

Open Michaelvll opened 2 weeks ago

Michaelvll commented 2 weeks ago

A user encountered an issue where when they submit ~1000 jobs to a cluster with ~100 nodes, at the end there are 4 jobs remain in PENDING state, and other jobs are in terminal states.

When checking the ray job list, it seems the latest job being ray job submit'ed is in PENDING state, although ray status shows all CPUs/GPUs are available, i.e. ray job does not start the job in PENDING state.

Version & Commit info: