Open Jeffwan opened 7 months ago
Oops - realized you're talking about multi-node
Deleted my comment
I think this make sense. Adding @njhill here for context about local multiprocessing and @youkaichao for implementing the nccl wrapper/abstraction for distributed communication. The main work here seems to be setting the right env var through K8s (stateful set?) such that the containers know each other's address and their perspective group/rank.
Yeah, I think the main todo is we have to know the process is launched by kube etc, and only the master node & process starts the api server while the other processes just join the task group.
Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.
For multi-GPU single node without Ray, we have https://github.com/vllm-project/vllm/pull/3466, which works very well for us and I hope to get merged soon.
Currently we always let the main process launch all the other tasks. That's why we rely on ray, e.g. launching tasks in another node.
Yes, to get to multi-node without Ray, there are two parts needed: (1) distributed process orchestration and (2) some IPC mechanism that will work between nodes.
https://github.com/vllm-project/vllm/pull/3763 is a step towards (2), and we are thinking torch.distributed (probably with CPU/gloo) could be used for all IPC (i.e. including the message-passing that Ray is currently used for).
For (1) we would need to support an option to launch worker processes independently of the main process. Then Kubernetes could be used with either a pod per worker/gpu or per node. There is a proposal for a new Kube API to make this part easier, but in the meantime it could be achieved with more explicit manual configuration.
By the way, this is not tied to ray
or kubernetes. A more general way is to free vllm
of launching processes. Instead, we let others launch vllm
processes, and only the master process launches api server.
For example, one option is torchrun
, e.g.
# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
This way, torchrun
launches multiple processes, and each process executes the module vllm.entrypoints.openai.api_server
. Inside the module, all environment variables are set, and only the process with rank==0
launches an api server. The rest process directly joins the master process for distributed inference.
This method is aganostic to different cluster management tools. As long as we can execute commands on each node, we are good:
The only thing needed from cluster management tools, is to assign a master ip and a master port, and then execute the same command on each node.
By the way, this is not tied to
ray
or kubernetes. A more general way is to freevllm
of launching processes. Instead, we let others launchvllm
processes, and only the master process launches api server.
@youkaichao yes, that's what I was referring to by "we would need to support an option to launch worker processes independently of the main process".
This way,
torchrun
launches multiple processes, and each process executes the modulevllm.entrypoints.openai.api_server
. Inside the module, all environment variables are set, and only the process withrank==0
launches an api server. The rest process directly joins the master process for distributed inference.
Agree, once we have that support then torchrun
could be one option. See also the discussion here: https://github.com/vllm-project/vllm/pull/3691#issuecomment-2030849356.
However IMHO we should support both modes. At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.
At least for single-node, it's also nice to be able to just launch one vllm process and have it run the other workers in a self-contained way as it does now.
Agree. And that's kind of a UX problem, which can be done in a unified way. vllm
can detect whether it is inside torchrun
by inspecting environment variables. And when it is not inside torchrun
but tensor parallel size > 1
, then it knows to launch workers itself.
This is not limited to torchrun
. If we go for other options for launch worker processes independently of the main process, vllm
can also detect whether it is launched by these options.
@Jeffwan Were you able to run vllm distributed in raycluster with tensor parallel successfully? If so could you please post the script you have used (and the raycluster configuration). We have been trying to get vllm to run in distributed manner with tensor parallel in ray but have failed so far.
@youkaichao @simon-mo Seems we'd like to refactor the interface and abstraction first. I will do more testing in downstream and keep an eye on the #3587 at the same time.
@pravingadakh yeah. Let me file a PR to improve the distributed inference guidance.
Just wanted to share that with LWS, Ray becomes an implementation detail of vllm when deploying on k8s: see https://docs.vllm.ai/en/latest/serving/deploying_with_lws.html for how that works.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
🚀 The feature, motivation and pitch
Currently, distributed inference (TP) in vLLM relies on ray to orchestrate the gpu workers. I briefly check the code and seems the core distributed communication is provided by
torch.distributed
with nccl backend, actor's communication is not done in Ray's own protocol. In this case, Ray just plays the role of orchestration and resource reservation (placement group). Please correct me if I am wrong.We do use Ray and KubeRay on Kubernetes and I've successfully tested vLLM distributed inference on this setup, confirming its functional operation. However, we have many users/platforms, we do not want to lock on Ray since some teams may not have enough Ray knowledge to cover the operation. My proposal is to provide a simple orchestration on top of
GPUExecutor
for those users who are familiar with cloud native techs. They would like to use Kubernetes's capability for orchestration (ray actors) and scheduling (placement group).Ideally, we would have both Ray and Kubernetes as orchestrators for vLLMs, providing our platform users with alternative options for their needs.
Please help check whether this proposal makes sense. I can contribute to this feature.
Alternatives
No response
Additional context
No response