Open depenglee1707 opened 2 months ago
Does it work without your change?
I guess it could work, but cannot test since I have only 2 GPU, and assign driver with 0.1. So is worker with non-full GPU matter for nccl? @youkaichao
Yes, each nccl process need to own one GPU.
Fine, so to run vllm on ray cluster, I have to waste some GPU, that's not expected, any suggestion? @youkaichao
Actually it's not a problem just for ray cluster scenario, I mean on node if I have 2 GPU, I guess it cannot serve as tensor_parallel_size = 2 with vllm, since driver process will occupied some GPU
I don't know your setup with ray. Our CI works fine with 2 GPU machine for tensor_parallel_size = 2
.
you are right, on node it works. I guess I need dig in and figure out on ray cluster. thanks
got the same issue,I solved it by updating accelerate from 0.26.0 to 0.30.0
num_gpus = self.cache_config.gpu_memory_utilization
so you also rewrite the original code to set num_gpus = self.cache_config.gpu_memory_utilization
no matter the world_size? and it works after upgrade accelerate
???
I guess we hit the same scenario, and it's really a rare case.
I try to use an existed ray cluster to launch vllm with model parallelism(world size > 1)
in this scenario, in original design, a ray deployment(actor) will be launch to help to create placement group
and init vllm engine with placement group
create pervasively. the solution works at least with vllm version vllm>=0.2.0,<0.2.6.
but with vllm 4.x, it requires driver
has GPU abilities, and it run same method in both driver
and worker
, really have no idea why to do that, see my question: https://github.com/vllm-project/vllm/issues/4999
Really expect some one give a clarification...
In case it helps, you can now use tensor parallel without Ray, see https://github.com/vllm-project/vllm/pull/4539.
Actually, in my case I adopt ray as a unity workload platform and want to run variously LLM workload in a single ray cluster. for example: llm inference by pure hg transformer, inference by GGUF format, inference by vllm .etc
but when I try to integrated with vllm. I hit such problem. could you kindly give some suggestion in https://github.com/vllm-project/vllm/issues/4999 or here, thanks a lot! @njhill
I also encountered this problem. I solved it by compiling the nccl source code and then modifying the path of libnccl.so.2 in the vllm source code.
Your current environment
🐛 Describe the bug
It's little complex for my case, I try to launch vllm in a ray cluster, since the latest vllm requires the driver process has GPU capability. but I do not want to waste a GPU for driver(could not use by worker), so I changed source code of vllm: https://github.com/vllm-project/vllm/blob/a98187cf7227695819e199e2e3ad35be0a9a84f3/vllm/executor/ray_gpu_executor.py#L66-L71 to always use
num_gpus = self.cache_config.gpu_memory_utilization
, no mattertensor_parallel_size
, that's means I can make worker and driver share one GPU.unfortunate, the process is pending forever, to be specific it's pending in
nccl.ncclCommInitRank
Any suggestion? or any suggestion to launch vllm on ray cluster?