vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.64k stars 3.19k forks source link

Support Kuberenetes for Distributed Serving #457

Open sam-h-bean opened 12 months ago

sam-h-bean commented 12 months ago

Only having support for ray for distributed inference will significantly reduce adoption of this tool if it truly is more performant than TGI. TGI can be run as a black-box image on Kubernetes with support for sharded models and vLLM should support this as well.

hughesadam87 commented 7 months ago

Does this ticket mean that distributed serving is not supported in kubernetes, even if k8s has ray installed as per this quickstart guide?

https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html

The docs are very sparse here and I am confused, since they imply ray can be used for distributed inf.: https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html