skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[feature] better handling of failed rollouts #4312

Open alita-moore opened 1 week ago

alita-moore commented 1 week ago

I’ve observed an issue during the rollout of updates to my application. When I deploy a new version, the new replica initially enters the READY state but shortly afterward transitions to NOT_READY. Meanwhile, the previous replica—which is version 1—enters a shutting down status. It seems that even if the new replica becomes NOT_READY, the shutdown of the previous replica continues and isn’t canceled.

Example:

•   I deploy Version 2 of my application.
•   The new replica for Version 2 starts and reports as READY.
•   Version 1 transitions to SHUTTING DOWN
•   Shortly after, the new replica (version 2) transitions to NOT_READY due to an issue.
•   Even though the new replica is now NOT_READY, the old replica continues to shut down and doesn’t revert to READY status.
•   This results in neither replica being fully operational.

I haven’t tested what happens if the new replica immediately enters NOT_READY upon deployment. It would be ideal if the rollout process could automatically detect such failures and revert to the last working version, ensuring continuous availability.

Version & Commit info:

concretevitamin commented 1 week ago

cc @cblmemo @MaoZiming

cblmemo commented 1 week ago

Hi @alita-moore ! Thanks for reporting this. We do have some checks against unrecoverable error and stop scaling after detect such error. However, the case you described in the PR is a little bit complicated - it turns to READY first, and then NOT_READY, making it very similar to network transient error.

https://github.com/skypilot-org/skypilot/blob/914328acb8269d79e304ad891f84d220e077565c/sky/serve/autoscalers.py#L409-L416

Though, I think there are two things we should do:

  1. Instead of "stop scaling", we should fallback to the latest ready version (or at least have a feature flag to enable this);
  2. Implement some failure-cnt based method to detect such special case of error (e.g. if it fails on 10+ replicas, it is unlikely a netrowk transient error).

Thanks for reporting this! We'll keep to work on this. LMK if you have other suggestions.