vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.42k stars 4.21k forks source link

[Bug]: cannot run model when TP>1 (already run debug file) #9369

Open jli943 opened 4 hours ago

jli943 commented 4 hours ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

Model Input Dumps

model = LLM("DeepSeek-Coder-V2-Lite-Base-Autofp8", trust_remote_code=True, max_model_len=4096, tensor_parallel_size=4)

🐛 Describe the bug

$NCCL_DEBUG=TRACE torchrun --nproc-per-node=4 vLLM_debug.py kmaker-54-033138205093:44710:44710 [0] NCCL INFO Bootstrap : Using eth0:33.138.205.93<0> kmaker-54-033138205093:44710:44710 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation kmaker-54-033138205093:44710:44710 [0] NCCL INFO cudaDriverVersion 12010

kmaker-54-033138205093:44710:44710 [0] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 'initialization error' NCCL version 2.20.5+cuda12.4 kmaker-54-033138205093:44713:44713 [3] NCCL INFO cudaDriverVersion 12010 kmaker-54-033138205093:44711:44711 [1] NCCL INFO cudaDriverVersion 12010 kmaker-54-033138205093:44712:44712 [2] NCCL INFO cudaDriverVersion 12010

kmaker-54-033138205093:44713:44713 [3] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 'initialization error'

kmaker-54-033138205093:44711:44711 [1] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 'initialization error'

kmaker-54-033138205093:44712:44712 [2] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 'initialization error' kmaker-54-033138205093:44713:44713 [3] NCCL INFO Bootstrap : Using eth0:33.138.205.93<0> kmaker-54-033138205093:44711:44711 [1] NCCL INFO Bootstrap : Using eth0:33.138.205.93<0> kmaker-54-033138205093:44712:44712 [2] NCCL INFO Bootstrap : Using eth0:33.138.205.93<0> kmaker-54-033138205093:44713:44713 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation kmaker-54-033138205093:44711:44711 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation kmaker-54-033138205093:44712:44712 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation kmaker-54-033138205093:44710:44744 [0] NCCL INFO Failed to open libibverbs.so[.1] kmaker-54-033138205093:44711:44747 [1] NCCL INFO Failed to open libibverbs.so[.1] kmaker-54-033138205093:44713:44745 [3] NCCL INFO Failed to open libibverbs.so[.1] kmaker-54-033138205093:44711:44747 [1] NCCL INFO NET/Socket : Using [0]eth0:33.138.205.93<0> kmaker-54-033138205093:44713:44745 [3] NCCL INFO NET/Socket : Using [0]eth0:33.138.205.93<0> kmaker-54-033138205093:44710:44744 [0] NCCL INFO NET/Socket : Using [0]eth0:33.138.205.93<0> kmaker-54-033138205093:44713:44745 [3] NCCL INFO Using non-device net plugin version 0 kmaker-54-033138205093:44711:44747 [1] NCCL INFO Using non-device net plugin version 0 kmaker-54-033138205093:44710:44744 [0] NCCL INFO Using non-device net plugin version 0 kmaker-54-033138205093:44713:44745 [3] NCCL INFO Using network Socket kmaker-54-033138205093:44711:44747 [1] NCCL INFO Using network Socket kmaker-54-033138205093:44710:44744 [0] NCCL INFO Using network Socket kmaker-54-033138205093:44712:44746 [2] NCCL INFO Failed to open libibverbs.so[.1] kmaker-54-033138205093:44712:44746 [2] NCCL INFO NET/Socket : Using [0]eth0:33.138.205.93<0> kmaker-54-033138205093:44712:44746 [2] NCCL INFO Using non-device net plugin version 0 kmaker-54-033138205093:44712:44746 [2] NCCL INFO Using network Socket kmaker-54-033138205093:44712:44746 [2] NCCL INFO comm 0x8895470 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 189000 commId 0xe045986dcbca5b24 - Init START kmaker-54-033138205093:44710:44744 [0] NCCL INFO comm 0x7c1fa60 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId d1000 commId 0xe045986dcbca5b24 - Init START kmaker-54-033138205093:44711:44747 [1] NCCL INFO comm 0x785e950 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId d2000 commId 0xe045986dcbca5b24 - Init START kmaker-54-033138205093:44713:44745 [3] NCCL INFO comm 0x82ed0b0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 18a000 commId 0xe045986dcbca5b24 - Init START kmaker-54-033138205093:44710:44744 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff

kmaker-54-033138205093:44710:44744 [0] transport/nvls.cc:246 NCCL WARN Cuda failure 3 'initialization error' kmaker-54-033138205093:44710:44744 [0] NCCL INFO init.cc:1010 -> 1 kmaker-54-033138205093:44710:44744 [0] NCCL INFO init.cc:1501 -> 1 kmaker-54-033138205093:44710:44744 [0] NCCL INFO group.cc:64 -> 1 [Async thread] kmaker-54-033138205093:44710:44710 [0] NCCL INFO group.cc:418 -> 1 kmaker-54-033138205093:44710:44710 [0] NCCL INFO init.cc:1876 -> 1 kmaker-54-033138205093:44711:44747 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff

kmaker-54-033138205093:44711:44747 [1] transport/nvls.cc:246 NCCL WARN Cuda failure 3 'initialization error' kmaker-54-033138205093:44711:44747 [1] NCCL INFO init.cc:1010 -> 1 kmaker-54-033138205093:44711:44747 [1] NCCL INFO init.cc:1501 -> 1 kmaker-54-033138205093:44711:44747 [1] NCCL INFO group.cc:64 -> 1 [Async thread] kmaker-54-033138205093:44712:44746 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff,00000000 kmaker-54-033138205093:44711:44711 [1] NCCL INFO group.cc:418 -> 1 kmaker-54-033138205093:44711:44711 [1] NCCL INFO init.cc:1876 -> 1

kmaker-54-033138205093:44712:44746 [2] transport/nvls.cc:246 NCCL WARN Cuda failure 3 'initialization error' kmaker-54-033138205093:44712:44746 [2] NCCL INFO init.cc:1010 -> 1 kmaker-54-033138205093:44712:44746 [2] NCCL INFO init.cc:1501 -> 1 kmaker-54-033138205093:44713:44745 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff,00000000 kmaker-54-033138205093:44712:44746 [2] NCCL INFO group.cc:64 -> 1 [Async thread]

kmaker-54-033138205093:44713:44745 [3] transport/nvls.cc:246 NCCL WARN Cuda failure 3 'initialization error' kmaker-54-033138205093:44713:44745 [3] NCCL INFO init.cc:1010 -> 1 kmaker-54-033138205093:44713:44745 [3] NCCL INFO init.cc:1501 -> 1 kmaker-54-033138205093:44713:44745 [3] NCCL INFO group.cc:64 -> 1 [Async thread] kmaker-54-033138205093:44712:44712 [2] NCCL INFO group.cc:418 -> 1 kmaker-54-033138205093:44712:44712 [2] NCCL INFO init.cc:1876 -> 1 kmaker-54-033138205093:44713:44713 [3] NCCL INFO group.cc:418 -> 1 kmaker-54-033138205093:44713:44713 [3] NCCL INFO init.cc:1876 -> 1 kmaker-54-033138205093:44711:44711 [1] NCCL INFO comm 0x785e950 rank 1 nranks 4 cudaDev 1 busId d2000 - Abort COMPLETE kmaker-54-033138205093:44710:44710 [0] NCCL INFO comm 0x7c1fa60 rank 0 nranks 4 cudaDev 0 busId d1000 - Abort COMPLETE kmaker-54-033138205093:44713:44713 [3] NCCL INFO comm 0x82ed0b0 rank 3 nranks 4 cudaDev 3 busId 18a000 - Abort COMPLETE kmaker-54-033138205093:44712:44712 [2] NCCL INFO comm 0x8895470 rank 2 nranks 4 cudaDev 2 busId 189000 - Abort COMPLETE [rank2]: Traceback (most recent call last): [rank2]: File "/ossfs/node_48498863/workspace/vLLM_debug.py", line 8, in [rank2]: dist.all_reduce(data, op=dist.ReduceOp.SUM) [rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank2]: return func(*args, kwargs) [rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce [rank2]: work = group.allreduce([tensor], opts) [rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank2]: ncclUnhandledCudaError: Call to CUDA function failed. [rank2]: Last error: [rank2]: Cuda failure 3 'initialization error' [rank3]: Traceback (most recent call last): [rank3]: File "/ossfs/node_48498863/workspace/vLLM_debug.py", line 8, in [rank3]: dist.all_reduce(data, op=dist.ReduceOp.SUM) [rank3]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank3]: return func(*args, *kwargs) [rank3]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce [rank3]: work = group.allreduce([tensor], opts) [rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank3]: ncclUnhandledCudaError: Call to CUDA function failed. [rank3]: Last error: [rank3]: Cuda failure 3 'initialization error' [rank1]: Traceback (most recent call last): [rank1]: File "/ossfs/node_48498863/workspace/vLLM_debug.py", line 8, in [rank1]: dist.all_reduce(data, op=dist.ReduceOp.SUM) [rank1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank1]: return func(args, kwargs) [rank1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce [rank1]: work = group.allreduce([tensor], opts) [rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank1]: ncclUnhandledCudaError: Call to CUDA function failed. [rank1]: Last error: [rank1]: Cuda failure 3 'initialization error' [rank0]: Traceback (most recent call last): [rank0]: File "/ossfs/node_48498863/workspace/vLLM_debug.py", line 8, in [rank0]: dist.all_reduce(data, op=dist.ReduceOp.SUM) [rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank0]: return func(*args, *kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce [rank0]: work = group.allreduce([tensor], opts) [rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank0]: ncclUnhandledCudaError: Call to CUDA function failed. [rank0]: Last error: [rank0]: Cuda failure 3 'initialization error' [rank0]:[W1015 17:54:03.968600650 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) E1015 17:54:04.185000 140153307133760 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 44710) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

vLLM_debug.py FAILED

Failures: [1]: time : 2024-10-15_17:54:04 host : kmaker-54-033138205093 rank : 1 (local_rank: 1) exitcode : 1 (pid: 44711) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-10-15_17:54:04 host : kmaker-54-033138205093 rank : 2 (local_rank: 2) exitcode : 1 (pid: 44712) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-10-15_17:54:04 host : kmaker-54-033138205093 rank : 3 (local_rank: 3) exitcode : 1 (pid: 44713) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-10-15_17:54:04 host : kmaker-54-033138205093 rank : 0 (local_rank: 0) exitcode : 1 (pid: 44710) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Before submitting a new issue...

zeyang12-jpg commented 3 hours ago

same issue when I use vllm v0.6.2/0.6.3,when change the tp>1,the process will be killed soon,even I try torchrun/ray to use the script.

zeyang12-jpg commented 3 hours ago

I followed the comment in https://github.com/vllm-project/vllm/issues/4019,but it didn't work when I added --enforce-eager