skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.7k stars 495 forks source link

[BUG][AWS] Serving a new controller can result in terminating previously served models. #4143

Open JGSweets opened 3 hours ago

JGSweets commented 3 hours ago

When these conditions are true, a new controller may terminate existing resources being served by another controller.

This may results in a model node's cluster_name with the same name as an existing model node depending on version. e.g.


Existing controller -- USER_ID_HASH=12345678 Controller node: sky-sky-serve-controller-12345678-5678-head model node: sky-<SERVICE_NAME>-<VERSION>-5678-head


New Controller -- USER_ID_HASH=11115678 Controller node: sky-sky-serve-controller-11115678-5678-head model node: sky-<SERVICE_NAME>-<VERSION>-5678-head


In this case, if the <VERSION> matches, the existing may get terminated.

I believe this results from the filter in terminate only looking for the Name as opposed:

return [{
    'Name': f'tag:{constants.TAG_RAY_CLUSTER_NAME}',
    'Values': [cluster_name_on_cloud],
}]

I'm not sure if this widespread across other deployment platforms.

In AWS, this could be resolved by including a new TAG in addition the the name that specifies the correct controller/cluster association and using that in the filter in as well.

Possible tags:

cblmemo commented 3 hours ago

Hi @JGSweets ! Thanks for reporting this error. Just want to make sure, this is for multiple users (in multiple laptops) running SkyPilot in a shared AWS project?

JGSweets commented 3 hours ago

Hi @JGSweets ! Thanks for reporting this error. Just want to make sure, this is for multiple users (in multiple laptops) running SkyPilot in a shared AWS project?

@cblmemo Actually, this case would be for a single compute resource where SKYPILOT_USER_ID is set via environment variables. I would be inclined to believe that multiple laptops would have a similar effect.