Open subhamde8247 opened 4 months ago
One hack we have been using now: use gcloud compute instances list
to list all VMs that match the pattern sky-spot-controller-
at the end of each run and deleting them.
Wondering if I am missing something or if there is a better solution.
from a docker container inside GCP VM that has a service account attached to it
Thanks for the report @subhamde8247! Could you share some details of this client VM? For example,
@subhamde8247 Got it. Some followups:
~/.sky
to the new VM/container everyday? This would explain why the new VM/container reuse the same spot controller.~/.ssh/sky-key{.pub}
, change on the new VM/container? A newly generated key would explain why the connection to the same spot controller VM is unsuccessful.
~/.ssh/sky-key{.pub}
to the new VM/container everyday. This way the same spot controller can be reused, and during idle periods it'd be autostopped to save costs.~/.sky
to VM/containerThe error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.
- We do not explicitly mount
~/.sky
to VM/container- SSH key changes on new VM/container.
The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.
Thanks for sharing more details @subhamde8247! One hypothesis of this would be that both the linux username and the python -c "import socket; print(socket.gethostname())"
is the same for the multiple containers, causing the user hash we generated to identify different machines based on the two values being the same, which leads to using the same spot controller.
To confirm the hypothesis, it would be nice to check if cat ~/.sky/user_hash
has the same value across multiple VM/containers.
There are several workarounds:
~/.sky/user_hash
: python -c "import uuid; print(uuid.uuid4().hex[:8])" > ~/.sky/user_hash
We will also look into the issue and see if the username and the python -c "import socket; print(socket.gethostname())"
not sufficient for identifying a user : )
Confirmed that cat ~/.sky/user_hash
is same for multiple runs of docker container when launched from multiple VMs. However, the hashes are different for multiple runs when same container is run from my local.
explicitly upload/mounting those keys
yeah, this will add some complexity of storing these keys in GCP secret manager, and properly loading them during the container start-up each day.
randomly generate a user hash for each VM/container
we don't mind having a new spot controller for each daily job. The only issue - old spot controllers are not auto-downed and we are left with a bunch of stopped spot controller instances in our VM list (which we have to manually delete). If available, a --down
option for spot controller would work for us.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
I am getting error:
I am running LLM inferences using SkyPilot and vLLM. I use
sky spot launch
from a docker container inside GCP VM that has a service account attached to it. We run this job daily where a VM with the same name and region is spun up every day, and the docker inside the VM in turn initiates sky cluster. On day 1, everything works fine . But on day 2,sky spot launch
tries to use the same spot controller from day 1 and errors out with the logs above. I can see that the spot controller is active and can SSH onto it.Any hints @Michaelvll ? I saw your comment here, but do not fully understand what's the solution here.