SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
A user encountered an issue where when they submit ~1000 jobs to a cluster with ~100 nodes, at the end there are 4 jobs remain in PENDING state, and other jobs are in terminal states.
When checking the ray job list, it seems the latest job being ray job submit'ed is in PENDING state, although ray status shows all CPUs/GPUs are available, i.e. ray job does not start the job in PENDING state.
A user encountered an issue where when they submit ~1000 jobs to a cluster with ~100 nodes, at the end there are 4 jobs remain in PENDING state, and other jobs are in terminal states.
When checking the
ray job list
, it seems the latest job beingray job submit
'ed is in PENDING state, althoughray status
shows all CPUs/GPUs are available, i.e.ray job
does not start the job inPENDING
state.Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_IN