skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Fluidstack] sky launch can leak instances when instance creation times out #4392

Open Xe opened 6 hours ago

Xe commented 6 hours ago

With a config like this:

resources:
  accelerators: [A100:1]
  cloud: fluidstack

setup: "echo hi"

run: "python -m http.server 8080"

Fluidstack instance creation fails and the instances in the cloud are unable to be destroyed. The panel returns a "try again in 60 seconds" 500 error with this JSON body:

{"message":"Unable to terminate instance. Please try again in 60 seconds."}

Version & Commit info: