vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.71k stars 4.66k forks source link

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

Open Jeffwan opened 7 months ago

Jeffwan commented 7 months ago

🚀 The feature, motivation and pitch

Currently, distributed inference (TP) in vLLM relies on ray to orchestrate the gpu workers. I briefly check the code and seems the core distributed communication is provided by torch.distributed with nccl backend, actor's communication is not done in Ray's own protocol. In this case, Ray just plays the role of orchestration and resource reservation (placement group). Please correct me if I am wrong.

We do use Ray and KubeRay on Kubernetes and I've successfully tested vLLM distributed inference on this setup, confirming its functional operation. However, we have many users/platforms, we do not want to lock on Ray since some teams may not have enough Ray knowledge to cover the operation. My proposal is to provide a simple orchestration on top of GPUExecutor for those users who are familiar with cloud native techs. They would like to use Kubernetes's capability for orchestration (ray actors) and scheduling (placement group).

Ideally, we would have both Ray and Kubernetes as orchestrators for vLLMs, providing our platform users with alternative options for their needs.

Please help check whether this proposal makes sense. I can contribute to this feature.

Alternatives

No response

Additional context

No response

robertgshaw2-neuralmagic commented 7 months ago

Oops - realized you're talking about multi-node

Deleted my comment

simon-mo commented 7 months ago

I think this make sense. Adding @njhill here for context about local multiprocessing and @youkaichao for implementing the nccl wrapper/abstraction for distributed communication. The main work here seems to be setting the right env var through K8s (stateful set?) such that the containers know each other's address and their perspective group/rank.

youkaichao commented 7 months ago

Yeah, I think the main todo is we have to know the process is launched by kube etc, and only the master node & process starts the api server while the other processes just join the task group.

Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.

njhill commented 7 months ago

For multi-GPU single node without Ray, we have https://github.com/vllm-project/vllm/pull/3466, which works very well for us and I hope to get merged soon.

Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.

Yes, to get to multi-node without Ray, there are two parts needed: (1) distributed process orchestration and (2) some IPC mechanism that will work between nodes.

https://github.com/vllm-project/vllm/pull/3763 is a step towards (2), and we are thinking torch.distributed (probably with CPU/gloo) could be used for all IPC (i.e. including the message-passing that Ray is currently used for).

For (1) we would need to support an option to launch worker processes independently of the main process. Then Kubernetes could be used with either a pod per worker/gpu or per node. There is a proposal for a new Kube API to make this part easier, but in the meantime it could be achieved with more explicit manual configuration.

youkaichao commented 7 months ago

By the way, this is not tied to ray or kubernetes. A more general way is to free vllm of launching processes. Instead, we let others launch vllm processes, and only the master process launches api server.

For example, one option is torchrun, e.g.

# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args

# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args

This way, torchrun launches multiple processes, and each process executes the module vllm.entrypoints.openai.api_server . Inside the module, all environment variables are set, and only the process with rank==0 launches an api server. The rest process directly joins the master process for distributed inference.

This method is aganostic to different cluster management tools. As long as we can execute commands on each node, we are good:

The only thing needed from cluster management tools, is to assign a master ip and a master port, and then execute the same command on each node.

njhill commented 7 months ago

By the way, this is not tied to ray or kubernetes. A more general way is to free vllm of launching processes. Instead, we let others launch vllm processes, and only the master process launches api server.

@youkaichao yes, that's what I was referring to by "we would need to support an option to launch worker processes independently of the main process".

This way, torchrun launches multiple processes, and each process executes the module vllm.entrypoints.openai.api_server . Inside the module, all environment variables are set, and only the process with rank==0 launches an api server. The rest process directly joins the master process for distributed inference.

Agree, once we have that support then torchrun could be one option. See also the discussion here: https://github.com/vllm-project/vllm/pull/3691#issuecomment-2030849356.

However IMHO we should support both modes. At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.

youkaichao commented 7 months ago

At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.

Agree. And that's kind of a UX problem, which can be done in a unified way. vllm can detect whether it is inside torchrun by inspecting environment variables. And when it is not inside torchrun but tensor parallel size > 1, then it knows to launch workers itself.

This is not limited to torchrun. If we go for other options for launch worker processes independently of the main process, vllm can also detect whether it is launched by these options.

pravingadakh commented 7 months ago

@Jeffwan Were you able to run vllm distributed in raycluster with tensor parallel successfully? If so could you please post the script you have used (and the raycluster configuration). We have been trying to get vllm to run in distributed manner with tensor parallel in ray but have failed so far.

Jeffwan commented 7 months ago

@youkaichao @simon-mo Seems we'd like to refactor the interface and abstraction first. I will do more testing in downstream and keep an eye on the #3587 at the same time.

Jeffwan commented 7 months ago

@pravingadakh yeah. Let me file a PR to improve the distributed inference guidance.

ahg-g commented 6 months ago

Just wanted to share that with LWS, Ray becomes an implementation detail of vllm when deploying on k8s: see https://docs.vllm.ai/en/latest/serving/deploying_with_lws.html for how that works.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!