Open JGSweets opened 3 hours ago
Hi @JGSweets ! Thanks for reporting this error. Just want to make sure, this is for multiple users (in multiple laptops) running SkyPilot in a shared AWS project?
Hi @JGSweets ! Thanks for reporting this error. Just want to make sure, this is for multiple users (in multiple laptops) running SkyPilot in a shared AWS project?
@cblmemo Actually, this case would be for a single compute resource where SKYPILOT_USER_ID
is set via environment variables. I would be inclined to believe that multiple laptops would have a similar effect.
When these conditions are true, a new controller may terminate existing resources being served by another controller.
This may results in a model node's
cluster_name
with the same name as an existing model node depending on version. e.g.Existing controller -- USER_ID_HASH=12345678 Controller node:
sky-sky-serve-controller-12345678-5678-head
model node:sky-<SERVICE_NAME>-<VERSION>-5678-head
New Controller -- USER_ID_HASH=11115678 Controller node:
sky-sky-serve-controller-11115678-5678-head
model node:sky-<SERVICE_NAME>-<VERSION>-5678-head
In this case, if the
<VERSION>
matches, the existing may get terminated.I believe this results from the filter in terminate only looking for the Name as opposed:
I'm not sure if this widespread across other deployment platforms.
In AWS, this could be resolved by including a new TAG in addition the the name that specifies the correct controller/cluster association and using that in the filter in as well.
Possible tags: