skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.73k stars 498 forks source link

[Jobs] A way to keep the managed job for a while after user program failure #4245

Open Michaelvll opened 2 hours ago

Michaelvll commented 2 hours ago

A user reported that they met an indeterministic error on a managed job and wanted to log into the cluster to debug. We can offer a way to avoid the job cluster being terminated after user program failure for a user-specified duration, e.g.

sky jobs launch --debug-grace-period-on-errors 30 task.yaml

Only terminate the cluster after 30 mins when an error is triggered by the job.

Version & Commit info:

concretevitamin commented 2 hours ago

We can see if more people want this, or mark it experimental. Maybe --keep-alive-minutes-on-errors?