[Feature]: Support multi-node serving on Kubernetes

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.52k stars 4.43k forks source link

[Feature]: Support multi-node serving on Kubernetes #8074

Open linnlh opened 2 months ago

linnlh commented 2 months ago

🚀 The feature, motivation and pitch

Hi, I'm currently working on deploying vLLM distributed on multi-node in k8s cluster. I saw that the official documentation provided a link by using LWS to deploy vllm for distributed model serving. The KubeRay team also provided a solution for multi-node deployment by using Ray Serve. But neither of these solutions has been integrated into the vllm codebase.

I was wondering if there are any development plans in it for vllm offcial team? If so, I am willing to provide relevant support in terms of code.

Alternatives

I have tried ray service example to deploy a 2-node serving. And it supposes to work. But there are some parts of the code need to be modified to be compatible with the latest version of vLLM.

The related works are listed below:

3522

The Guide for deploying distributed inference service with vLLM by using LWS

The example for setting up multi-GPU serving with RayServe

youkaichao commented 2 months ago

cc @richardliaw @rkooo567 if you need to update ray doc

youkaichao commented 2 months ago

@flliny we'd be happy to add some links in https://docs.vllm.ai/en/latest/serving/distributed_serving.html#multi-node-inference-and-serving to link to your urls.

the vllm team itself will not work on kuberay integration.

linnlh commented 2 months ago

@flliny we'd be happy to add some links in https://docs.vllm.ai/en/latest/serving/distributed_serving.html#multi-node-inference-and-serving to link to your urls.

the vllm team itself will not work on kuberay integration.

Thanks for reply.🙏

sunmac commented 1 month ago

@flliny we'd be happy to add some links in https://docs.vllm.ai/en/latest/serving/distributed_serving.html#multi-node-inference-and-serving to link to your urls.

the vllm team itself will not work on kuberay integration.

There's no official version of Ray Serve yet. Does the official team have any intention to integrate it? Because Kubernetes Deployments aren't suitable for deploying distributed inference services. I'm considering using either LWS or Ray Serve to deploy my service. If the official provides multi-replica distributed inference services for Ray Serve, it would facilitate the deployment of services.

Jeffwan commented 1 month ago

@linnlh we have a proposal and already built the internal version for such case. https://docs.google.com/document/d/1K8Ve6KrabpexH-gIEcby9tKTEFysTd6kOZKZa_EdgRQ/edit#heading=h.fw9nktz8l24d the OSS plan is on the way. feel free to let me know your feedbacks