AssertionError after manually deleting runpod instance

alita-moore commented 2 weeks ago

I terminated the instance running in runpod manually, I expected that the service would automatically recover. I wanted to do this because the remote docker image had updated but the service was not updating it. But now I'm getting this every time I do sky serve status

> sky serve status                    
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 385, in get_service_status_encoded
    service_status = _get_service_status(service_name)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 372, in _get_service_status
    record['replica_info'] = [
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/serve_utils.py", line 373, in <listcomp>
    info.to_info_dict(with_handle=True)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 469, in to_info_dict
    'endpoint': self.url,
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 440, in url
    endpoint_dict = core.endpoints(handle.cluster_name,
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/core.py", line 209, in endpoints
    return backend_utils.get_endpoints(cluster=cluster, port=port)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 2682, in get_endpoints
    port_details = provision_lib.query_ports(repr(cloud),
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/__init__.py", line 50, in _wrapper
    return impl(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/runpod/instance.py", line 235, in query_ports
    assert len(instances) == 1
AssertionError

Services
Failed to fetch service statuses due to connection issues. Please try again later. Details: [RuntimeError] Failed to fetch services

Woudl this assertion also limit my ability to have more instances than those allocated by skypilot? i.e. if I wanted to manually create a new pod would that affect this?

Version & Commit info:

sky -v: skypilot, version 0.7.0
sky -c: 3f625886bf1b13ee463a9f8e0f6741f620f7f66f

alita-moore commented 2 weeks ago

I am now unable to delete / monitor the status of my services:

sky serve status                           
AssertionError

Services
Failed to fetch service statuses due to connection issues. Please try again later. Details: [RuntimeError] Failed to fetch services

alita-moore commented 2 weeks ago

I can't even do sky down without it being blocked / stopped so now my system is stuck :(

alita-moore commented 2 weeks ago

I had to delete my home sky directory at ~\.sky and then manually terminate the running instances on AWS. :(

concretevitamin commented 2 weeks ago

Hello @alita-moore. To understand: did you have a single service with replicas from both runpod and AWS? Where was the serve controller?

Is it correct that you then went ahead to manually terminate the runpod replica instance, and started seeing the above issues?

alita-moore commented 2 weeks ago

yeah the controller was on aws and the worker was on runpod, and yeah I deleted the replica from runpod manually and then the issues started

concretevitamin commented 2 weeks ago

cc @cblmemo to help repro

cblmemo commented 2 weeks ago

Hi @alita-moore , thanks for reporting this issue! Just submitted a PR #4288 to fix this.

Woudl this assertion also limit my ability to have more instances than those allocated by skypilot? i.e. if I wanted to manually create a new pod would that affect this?

Currently, we are filtering based on the name of the pods (e.g. f'{cluster_name}-{head,worker}'. As long as the pod name you created has no conflict, it should be fine ;)

skypilot-org / skypilot

AssertionError after manually deleting runpod instance #4286