[Serve] Failure-count based unrecoverable failure detection

          Hi @alita-moore ! Thanks for reporting this. We do have some checks against unrecoverable error and stop scaling after detect such error. However, the case you described in the PR is a little bit complicated - it turns to READY first, and then NOT_READY, making it very similar to network transient error.

https://github.com/skypilot-org/skypilot/blob/914328acb8269d79e304ad891f84d220e077565c/sky/serve/autoscalers.py#L409-L416

Though, I think there are two things we should do:

Instead of "stop scaling", we should fallback to the latest ready version (or at least have a feature flag to enable this);
Implement some failure-cnt based method to detect such special case of error (e.g. if it fails on 10+ replicas, it is unlikely a netrowk transient error).

Thanks for reporting this! We'll keep to work on this. LMK if you have other suggestions.

Originally posted by @cblmemo in https://github.com/skypilot-org/skypilot/issues/4312#issuecomment-2474661242

skypilot-org / skypilot

[Serve] Failure-count based unrecoverable failure detection #4349