Closed alita-moore closed 2 weeks ago
I am now unable to delete / monitor the status of my services:
sky serve status
AssertionError
Services
Failed to fetch service statuses due to connection issues. Please try again later. Details: [RuntimeError] Failed to fetch services
I can't even do sky down
without it being blocked / stopped so now my system is stuck :(
I had to delete my home sky directory at ~\.sky
and then manually terminate the running instances on AWS. :(
Hello @alita-moore. To understand: did you have a single service with replicas from both runpod and AWS? Where was the serve controller?
Is it correct that you then went ahead to manually terminate the runpod replica instance, and started seeing the above issues?
yeah the controller was on aws and the worker was on runpod, and yeah I deleted the replica from runpod manually and then the issues started
cc @cblmemo to help repro
Hi @alita-moore , thanks for reporting this issue! Just submitted a PR #4288 to fix this.
Woudl this assertion also limit my ability to have more instances than those allocated by skypilot? i.e. if I wanted to manually create a new pod would that affect this?
Currently, we are filtering based on the name of the pods (e.g. f'{cluster_name}-{head,worker}'
. As long as the pod name you created has no conflict, it should be fine ;)
I terminated the instance running in runpod manually, I expected that the service would automatically recover. I wanted to do this because the remote docker image had updated but the service was not updating it. But now I'm getting this every time I do
sky serve status
Woudl this assertion also limit my ability to have more instances than those allocated by skypilot? i.e. if I wanted to manually create a new pod would that affect this?
Version & Commit info:
sky -v
: skypilot, version 0.7.0sky -c
: 3f625886bf1b13ee463a9f8e0f6741f620f7f66f