SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Hi @alita-moore ! Thanks for reporting this. We do have some checks against unrecoverable error and stop scaling after detect such error. However, the case you described in the PR is a little bit complicated - it turns to READY first, and then NOT_READY, making it very similar to network transient error.
Though, I think there are two things we should do:
Instead of "stop scaling", we should fallback to the latest ready version (or at least have a feature flag to enable this);
Implement some failure-cnt based method to detect such special case of error (e.g. if it fails on 10+ replicas, it is unlikely a netrowk transient error).
Thanks for reporting this! We'll keep to work on this. LMK if you have other suggestions.
Originally posted by @cblmemo in https://github.com/skypilot-org/skypilot/issues/4312#issuecomment-2474661242