skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Serve] Feature request: support num_nodes for the Controller #4368

Open HysunHe opened 6 days ago

HysunHe commented 6 days ago

Currently, the controller node is a sington, which may cause single point of failure.

Because the controller node has the function of proxying user requests to backend serving nodes besides its management role, the failure of it may cause the caller applications fail on calling the backend services.

Multiple controller nodes are useful.

If there are multiple controller nodes, even they are independent from each other, user can take the advantage of the high availability from them by, for example, employing load balancer or leveraging DNS round-robin functions.