SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
A user reported that they met an indeterministic error on a managed job and wanted to log into the cluster to debug. We can offer a way to avoid the job cluster being terminated after user program failure for a user-specified duration, e.g.
sky jobs launch --debug-grace-period-on-errors 30 task.yaml
Only terminate the cluster after 30 mins when an error is triggered by the job.
A user reported that they met an indeterministic error on a managed job and wanted to log into the cluster to debug. We can offer a way to avoid the job cluster being terminated after user program failure for a user-specified duration, e.g.
Only terminate the cluster after 30 mins when an error is triggered by the job.
Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_IN