skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.52k stars 464 forks source link

[Core][Serve] Cannot specify security groups for the serve nodes. #3473

Open JGSweets opened 4 months ago

JGSweets commented 4 months ago

Recently added in https://github.com/skypilot-org/skypilot/issues/1354, we can specify the default security_group_name for aws resources. However, when launching serve, there are two conflicts which occur between the serve config requirements and the sky config requirements:

  1. The serve resources requires a port in the serve config.
    • ValueError: Must only specify one port in resources. Each replica will use the port specified as application ingress port.

  2. Security group names cannot be referenced if port is in a resource.
    • ValueError: Cannot specify ports when AWS security group name is specified.

This could be resolved by allowing the serve config to specify the security_group_name for the resource as well.

Moreover, it would be ideal to be able to specify different security groups for the head and worker nodes as ray already provides this capability.

Thanks!

JGSweets commented 4 months ago

Looking into this fix looks quite complex in order to maintain the integrity of the port validation.

There looks to be a few routes to achieve this:

  1. Allow SGs to be used despite ports being mentioned.
    • SGs inbound/outbound may not properly align. (however, this is true already if you specify and launch a service)
    • validate SG ports in the resource when validating ports, but this requires an AWS call.
  2. Allow the LoadBalancer in serve to not set the ports
    • possibly done via the config -> serve -> controller schema
    • Similar issue of alignment between ports and SG
    • if sg set, could skip setting in the resource
  3. Allow SGs to be set instead of ports
    • not all cloud resources have SGs which then presents an agnostic problem for the resource

Other notable features when considering: