skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.71k stars 495 forks source link

[k8s][ux] Auto-exclude stale Kubernetes cloud #2807

Open romilbhardwaj opened 11 months ago

romilbhardwaj commented 11 months ago

I often terminate a Kubernetes cluster externally using the cloud console/cli (e.g., gcloud container clusters delete <cluster-name> --region us-central1-c), but I forget to run sky check to update the list of enabled clouds.

As a result, the next sky launch fails:

sky.exceptions.ResourcesUnavailableError: Timed out when trying to get node info from Kubernetes cluster. Please check if the cluster is healthy and retry.

We should consider printing a warning and continuing by either: 1) Excluding Kubernetes from the list of clouds considered by the optimizer 2) Removing Kubernetes from the list of enabled clouds stored in global user state.

1 is less aggressive and doesn't require user to re-run sky check in case it is a transient failure.

Michaelvll commented 8 months ago

This is also related to #3013

kbrgl commented 8 months ago

Going to self-assign and work on this!

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 21 hours ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.