Open alita-moore opened 1 week ago
cc @cblmemo @MaoZiming
Hi @alita-moore ! Thanks for reporting this. We do have some checks against unrecoverable error and stop scaling after detect such error. However, the case you described in the PR is a little bit complicated - it turns to READY first, and then NOT_READY, making it very similar to network transient error.
Though, I think there are two things we should do:
Thanks for reporting this! We'll keep to work on this. LMK if you have other suggestions.
I’ve observed an issue during the rollout of updates to my application. When I deploy a new version, the new replica initially enters the READY state but shortly afterward transitions to NOT_READY. Meanwhile, the previous replica—which is version 1—enters a shutting down status. It seems that even if the new replica becomes NOT_READY, the shutdown of the previous replica continues and isn’t canceled.
Example:
I haven’t tested what happens if the new replica immediately enters NOT_READY upon deployment. It would be ideal if the rollout process could automatically detect such failures and revert to the last working version, ensuring continuous availability.
Version & Commit info:
sky -v
: skypilot, version 0.7.0sky -c
: 3f625886bf1b13ee463a9f8e0f6741f620f7f66f