skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

AssertionError after manually deleting runpod instance #4286

Closed alita-moore closed 2 weeks ago

alita-moore commented 2 weeks ago

I terminated the instance running in runpod manually, I expected that the service would automatically recover. I wanted to do this because the remote docker image had updated but the service was not updating it. But now I'm getting this every time I do sky serve status

> sky serve status                    
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 385, in get_service_status_encoded
    service_status = _get_service_status(service_name)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 372, in _get_service_status
    record['replica_info'] = [
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 373, in <listcomp>
    info.to_info_dict(with_handle=True)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 469, in to_info_dict
    'endpoint': self.url,
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 440, in url
    endpoint_dict = core.endpoints(handle.cluster_name,
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/core.py", line 209, in endpoints
    return backend_utils.get_endpoints(cluster=cluster, port=port)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 2682, in get_endpoints
    port_details = provision_lib.query_ports(repr(cloud),
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/__init__.py", line 50, in _wrapper
    return impl(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/runpod/instance.py", line 235, in query_ports
    assert len(instances) == 1
AssertionError

Services
Failed to fetch service statuses due to connection issues. Please try again later. Details: [RuntimeError] Failed to fetch services

Woudl this assertion also limit my ability to have more instances than those allocated by skypilot? i.e. if I wanted to manually create a new pod would that affect this?

Version & Commit info:

alita-moore commented 2 weeks ago

I am now unable to delete / monitor the status of my services:

sky serve status                           
AssertionError

Services
Failed to fetch service statuses due to connection issues. Please try again later. Details: [RuntimeError] Failed to fetch services
alita-moore commented 2 weeks ago

I can't even do sky down without it being blocked / stopped so now my system is stuck :(

alita-moore commented 2 weeks ago

I had to delete my home sky directory at ~\.sky and then manually terminate the running instances on AWS. :(

concretevitamin commented 2 weeks ago

Hello @alita-moore. To understand: did you have a single service with replicas from both runpod and AWS? Where was the serve controller?

Is it correct that you then went ahead to manually terminate the runpod replica instance, and started seeing the above issues?

alita-moore commented 2 weeks ago

yeah the controller was on aws and the worker was on runpod, and yeah I deleted the replica from runpod manually and then the issues started

concretevitamin commented 2 weeks ago

cc @cblmemo to help repro

cblmemo commented 2 weeks ago

Hi @alita-moore , thanks for reporting this issue! Just submitted a PR #4288 to fix this.

Woudl this assertion also limit my ability to have more instances than those allocated by skypilot? i.e. if I wanted to manually create a new pod would that affect this?

Currently, we are filtering based on the name of the pods (e.g. f'{cluster_name}-{head,worker}'. As long as the pod name you created has no conflict, it should be fine ;)