skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.56k stars 469 forks source link

[k8s] `sky down` multiple clusters on GKE can cause error #3583

Open Michaelvll opened 4 months ago

Michaelvll commented 4 months ago
sky down -a
Terminating 12 clusters: t-kubernetes-storage-37-6d, t-multi-echo-38, t-large-job-queue-bc, t-minimal-13, t-multi-node-failure-17, t-env-check-aa, t-multi-hostname-17, t-cli-logs-5a, t-cancel-pytorch-3c, t-autodown-46, t-file-mounts-05, test-k8s. Proceed
? [Y/n]:
...
   return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/cli.py", line 2562, in down
    _down_or_stop_clusters(clusters,
  File "/home/gcpuser/skypilot/sky/cli.py", line 2880, in _down_or_stop_clusters
    subprocess_utils.run_in_parallel(_down_or_stop, clusters)
  File "/home/gcpuser/skypilot/sky/utils/subprocess_utils.py", line 65, in run_in_parallel
    return list(p.imap(func, args))
  File "/opt/conda/envs/sky/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
  File "/opt/conda/envs/sky/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/gcpuser/skypilot/sky/cli.py", line 2852, in _down_or_stop
    core.down(name, purge=purge)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 388, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/core.py", line 401, in down
    backend.teardown(handle, terminate=True, purge=purge)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 388, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 367, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/backends/backend.py", line 116, in teardown
    self._teardown(handle, terminate, purge)
  File "/home/gcpuser/skypilot/sky/backends/cloud_vm_ray_backend.py", line 3518, in _teardown
    self.teardown_no_lock(
  File "/home/gcpuser/skypilot/sky/backends/cloud_vm_ray_backend.py", line 3839, in teardown_no_lock
    provisioner.teardown_cluster(repr(cloud),
  File "/home/gcpuser/skypilot/sky/provision/provisioner.py", line 237, in teardown_cluster
    provision.terminate_instances(cloud_name, cluster_name.name_on_cloud,
  File "/home/gcpuser/skypilot/sky/provision/__init__.py", line 47, in _wrapper
    return impl(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/provision/kubernetes/instance.py", line 651, in terminate_instances
    pods = _filter_pods(namespace, tag_filters, None)
  File "/home/gcpuser/skypilot/sky/provision/kubernetes/instance.py", line 61, in _filter_pods
    pod_list = kubernetes.core_api().list_namespaced_pod(
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 15823, in list_namespaced_pod
    return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 15942, in list_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/kubernetes/stream/ws_client.py", line 529, in websocket_call
    raise ApiException(status=0, reason=str(e))
kubernetes.client.exceptions.ApiException: (0)
Reason: Handshake status 200 OK

Version & Commit info:

github-actions[bot] commented 1 day ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.