skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Serve] Failure-count based unrecoverable failure detection #4349

Open cblmemo opened 1 week ago

cblmemo commented 1 week ago
          Hi @alita-moore ! Thanks for reporting this. We do have some checks against unrecoverable error and stop scaling after detect such error. However, the case you described in the PR is a little bit complicated - it turns to READY first, and then NOT_READY, making it very similar to network transient error.

https://github.com/skypilot-org/skypilot/blob/914328acb8269d79e304ad891f84d220e077565c/sky/serve/autoscalers.py#L409-L416

Though, I think there are two things we should do:

  1. Instead of "stop scaling", we should fallback to the latest ready version (or at least have a feature flag to enable this);
  2. Implement some failure-cnt based method to detect such special case of error (e.g. if it fails on 10+ replicas, it is unlikely a netrowk transient error).

Thanks for reporting this! We'll keep to work on this. LMK if you have other suggestions.

Originally posted by @cblmemo in https://github.com/skypilot-org/skypilot/issues/4312#issuecomment-2474661242