skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.32k stars 435 forks source link

Issue with Sky Spot Controller when launched from GCP VM #3316

Open subhamde8247 opened 4 months ago

subhamde8247 commented 4 months ago

I am getting error:

 03-14 18:38:12 provisioner.py:76] Launching on GCP us-central1 (us-central1-a)
E 03-14 18:48:28 provisioner.py:584] *** Failed setting up cluster. ***
RuntimeError: Failed to SSH to X after timeout 600s, with Error: Warning: Permanently added ‘X’ (ED25519) to the list of known hosts.
gcpuser@X: Permission denied (publickey).

I am running LLM inferences using SkyPilot and vLLM. I use sky spot launch from a docker container inside GCP VM that has a service account attached to it. We run this job daily where a VM with the same name and region is spun up every day, and the docker inside the VM in turn initiates sky cluster. On day 1, everything works fine . But on day 2, sky spot launch tries to use the same spot controller from day 1 and errors out with the logs above. I can see that the spot controller is active and can SSH onto it.

Any hints @Michaelvll ? I saw your comment here, but do not fully understand what's the solution here.

subhamde8247 commented 4 months ago

One hack we have been using now: use gcloud compute instances list to list all VMs that match the pattern sky-spot-controller- at the end of each run and deleting them.

Wondering if I am missing something or if there is a better solution.

concretevitamin commented 4 months ago

from a docker container inside GCP VM that has a service account attached to it

Thanks for the report @subhamde8247! Could you share some details of this client VM? For example,

subhamde8247 commented 4 months ago
  1. “gcloud auth list” only shows the service account
  2. The client VM is deleted each day after running, and a new VM is created next day, and a new container started within that VM. So spot launch is triggered from a different container each day attached to a new VM instance.
concretevitamin commented 4 months ago

@subhamde8247 Got it. Some followups:

subhamde8247 commented 4 months ago
  1. We do not explicitly mount ~/.sky to VM/container
  2. SSH key changes on new VM/container.

The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.

Michaelvll commented 4 months ago
  1. We do not explicitly mount ~/.sky to VM/container
  2. SSH key changes on new VM/container.

The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.

Thanks for sharing more details @subhamde8247! One hypothesis of this would be that both the linux username and the python -c "import socket; print(socket.gethostname())" is the same for the multiple containers, causing the user hash we generated to identify different machines based on the two values being the same, which leads to using the same spot controller.

To confirm the hypothesis, it would be nice to check if cat ~/.sky/user_hash has the same value across multiple VM/containers.

There are several workarounds:

  1. share the SSH key across multiple VM/container by explicitly upload/mounting those keys to them.
  2. Or, randomly generate a user hash for each VM/container whenever it is firstly provisioned, by randomly generating the ~/.sky/user_hash: python -c "import uuid; print(uuid.uuid4().hex[:8])" > ~/.sky/user_hash

We will also look into the issue and see if the username and the python -c "import socket; print(socket.gethostname())" not sufficient for identifying a user : )

subhamde8247 commented 4 months ago

Confirmed that cat ~/.sky/user_hash is same for multiple runs of docker container when launched from multiple VMs. However, the hashes are different for multiple runs when same container is run from my local.

explicitly upload/mounting those keys

yeah, this will add some complexity of storing these keys in GCP secret manager, and properly loading them during the container start-up each day.

randomly generate a user hash for each VM/container

we don't mind having a new spot controller for each daily job. The only issue - old spot controllers are not auto-downed and we are left with a bunch of stopped spot controller instances in our VM list (which we have to manually delete). If available, a --down option for spot controller would work for us.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.