Issue with Sky Spot Controller when launched from GCP VM

subhamde8247 commented 4 months ago

I am getting error:

 03-14 18:38:12 provisioner.py:76] Launching on GCP us-central1 (us-central1-a)
E 03-14 18:48:28 provisioner.py:584] *** Failed setting up cluster. ***
RuntimeError: Failed to SSH to X after timeout 600s, with Error: Warning: Permanently added ‘X’ (ED25519) to the list of known hosts.
gcpuser@X: Permission denied (publickey).

I am running LLM inferences using SkyPilot and vLLM. I use sky spot launch from a docker container inside GCP VM that has a service account attached to it. We run this job daily where a VM with the same name and region is spun up every day, and the docker inside the VM in turn initiates sky cluster. On day 1, everything works fine . But on day 2, sky spot launch tries to use the same spot controller from day 1 and errors out with the logs above. I can see that the spot controller is active and can SSH onto it.

Any hints @Michaelvll ? I saw your comment here, but do not fully understand what's the solution here.

subhamde8247 commented 4 months ago

One hack we have been using now: use gcloud compute instances list to list all VMs that match the pattern sky-spot-controller- at the end of each run and deleting them.

Wondering if I am missing something or if there is a better solution.

concretevitamin commented 4 months ago

from a docker container inside GCP VM that has a service account attached to it

Thanks for the report @subhamde8247! Could you share some details of this client VM? For example,

what does “gcloud auth list” show inside this container? Does it only have the service account, or also some static credential files?
On each day, is the spot launch triggered from a different container on this client VM, or the same container?

subhamde8247 commented 4 months ago

“gcloud auth list” only shows the service account
The client VM is deleted each day after running, and a new VM is created next day, and a new container started within that VM. So spot launch is triggered from a different container each day attached to a new VM instance.

concretevitamin commented 4 months ago

@subhamde8247 Got it. Some followups:

Do you mount the same ~/.sky to the new VM/container everyday? This would explain why the new VM/container reuse the same spot controller.
Does the SSH key, ~/.ssh/sky-key{.pub}, change on the new VM/container? A newly generated key would explain why the connection to the same spot controller VM is unsuccessful.
- If this is the case, a fix should be mounting the same ~/.ssh/sky-key{.pub} to the new VM/container everyday. This way the same spot controller can be reused, and during idle periods it'd be autostopped to save costs.

subhamde8247 commented 4 months ago

We do not explicitly mount ~/.sky to VM/container
SSH key changes on new VM/container.

The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.

Michaelvll commented 4 months ago

We do not explicitly mount ~/.sky to VM/container

SSH key changes on new VM/container.

The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.

Thanks for sharing more details @subhamde8247! One hypothesis of this would be that both the linux username and the python -c "import socket; print(socket.gethostname())" is the same for the multiple containers, causing the user hash we generated to identify different machines based on the two values being the same, which leads to using the same spot controller.

To confirm the hypothesis, it would be nice to check if cat ~/.sky/user_hash has the same value across multiple VM/containers.

There are several workarounds:

share the SSH key across multiple VM/container by explicitly upload/mounting those keys to them.
Or, randomly generate a user hash for each VM/container whenever it is firstly provisioned, by randomly generating the ~/.sky/user_hash: python -c "import uuid; print(uuid.uuid4().hex[:8])" > ~/.sky/user_hash

We will also look into the issue and see if the username and the python -c "import socket; print(socket.gethostname())" not sufficient for identifying a user : )

subhamde8247 commented 4 months ago

Confirmed that cat ~/.sky/user_hash is same for multiple runs of docker container when launched from multiple VMs. However, the hashes are different for multiple runs when same container is run from my local.

explicitly upload/mounting those keys

yeah, this will add some complexity of storing these keys in GCP secret manager, and properly loading them during the container start-up each day.

randomly generate a user hash for each VM/container

we don't mind having a new spot controller for each daily job. The only issue - old spot controllers are not auto-downed and we are left with a bunch of stopped spot controller instances in our VM list (which we have to manually delete). If available, a --down option for spot controller would work for us.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

skypilot-org / skypilot

Issue with Sky Spot Controller when launched from GCP VM #3316