Open esmeetu opened 7 months ago
When does this problem occur? Is it related to #2152 ?
@youkaichao Nope, it's related custom-all-reduce feature. After Upgrade to nccl=2.19.3, everything is ok. Related issue: https://github.com/NVIDIA/nccl/issues/957 Fix commit: https://github.com/NVIDIA/nccl/commit/4365458757e4107ecbf629b2fd6e0e19a5d237c2 cc @WoosukKwon @hanzhi713
@esmeetu I guess it's NCCL's problem? Let me know if a fixed is needed from my side
@hanzhi713 Yes, i thinking so. I will test again after vllm upgrade pytorch to v2.2.2.
@hanzhi713 Why your custom all reduce kernel is influenced by nccl? IIUC, yours doesn't use nccl.🤔 And do you have ideas about why that nccl bug will result in my issue?
Allreduce with larger size (>=8mb) and other collectives (like gather) still need NCCL
any one tried vllm 0.3.3 + torch 2.1.1+cu118 with nccl==2.19.3? . By default vllm 0.3.3 + torch 2.1.1+cu118 installs nccl==2.18.3 that is giving the all_reduce error with multiple nodes
any one tried vllm 0.3.3 + torch 2.1.1+cu118 with nccl==2.19.3? . By default vllm 0.3.3 + torch 2.1.1+cu118 installs nccl==2.18.3 that is giving the all_reduce error with multiple nodes
Can you show your environment and error trace?
any one tried vllm 0.3.3 + torch 2.1.1+cu118 with nccl==2.19.3? . By default vllm 0.3.3 + torch 2.1.1+cu118 installs nccl==2.18.3 that is giving the all_reduce error with multiple nodes
Can you show your environment and error trace?
@Sande33p
export NCCL_DEBUG=TRACE
again to get more verbose output for debugging?@Sande33p
- can you run with
export NCCL_DEBUG=TRACE
again to get more verbose output for debugging?- if you are using multi-node inference, I suggest you build from source again. [Core] Support multi-node inference(eager and cuda graph) #3686 just fixed some issues with multi-node setup.
@youkaichao here is the error file with export NCCL_DEBUG=TRACE
error_2.txt
@Sande33p I took a look at your error log, and I find the following lines might be relevant:
x3005c0s37b1n0:24760:24760 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.18.3+cuda11.0 x3005c0s37b1n0:24760:24760 [0] misc/strongstream.cc:53 NCCL WARN NCCL cannot be captured in a graph if either it wasn't built with CUDA runtime >= 11.3 or if the installed CUDA driver < R465.
It seems your cuda version is too old. Can you try to upgrade your cuda version?
any one tried vllm 0.3.3 + torch 2.1.1+cu118 with nccl==2.19.3? . By default vllm 0.3.3 + torch 2.1.1+cu118 installs nccl==2.18.3 that is giving the all_reduce error with multiple nodes
Yes, I faced the same error when I tried vllm 0.4.1+cu11,torch 2.2.1+cu11,nccl==2.19.2,vllm-nccl-cu11 2.18.1.0.4.0 with multiple gpus.
When I set env export VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2
before start vllm application, it works fine.The version of nccl located at /usr/lib/x86_64-linux-gnu/libnccl.so.2 is 2.16.2
Same issue here, using docker image vllm/vllm-openai:latest. Is it related to the host cuda version?
NCCL version 2.20.5+cuda12.4
Is it related to the host cuda version?
Maybe. What's your host driver info?
Is it related to the host cuda version?
Maybe. What's your host driver info?
Can't know it, i am running vllm/vllm-openai:latest docker image inside RunPod.io and I'm having the same issue as the topic
you can follow the issue template https://github.com/vllm-project/vllm/issues/new/choose to run an environment collection script.
Allreduce with larger size (>=8mb) and other collectives (like gather) still need NCCL
@hanzhi713 @youkaichao May I ask, what was the original intention behind vLLM's development of custom allreduce?
@unix1986 you can benchmark it yourself, it is just faster
Your current environment
🐛 Describe the bug
llm_engine
is good.async_llm_engine
not work.Log: