vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.66k stars 4.65k forks source link

[Bug]: Running mistral-large results in an error related to NCCL #7801

Open White-Friday opened 3 months ago

White-Friday commented 3 months ago

Your current environment

The environment is the latest vllm-0.5.4's docker environment, and the command to run is:python3 api_server.py --port 10195 --model /data/models/Mistral-Large-Instruct-2407/ --served-model-name mistral --tensor-parallel-size 8 --max-model-len 4096 --disable-custom-all-reduce 

šŸ› Describe the bug

(VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), Traceback (most recent call last): (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] init_worker_distributed_environment(self.parallel_config, self.rank, (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] initialize_model_parallel(tensor_model_parallel_size, (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] _TP = init_model_parallel_group(group_ranks, (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] return GroupCoordinator( (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__ (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.pynccl_comm = PyNcclCommunicator( (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] raise RuntimeError(f"NCCL error: {error_str}") (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) (VllmWorkerProcess pid=232) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), Traceback (most recent call last): (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] init_worker_distributed_environment(self.parallel_config, self.rank, (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] initialize_model_parallel(tensor_model_parallel_size, (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] _TP = init_model_parallel_group(group_ranks, (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] return GroupCoordinator( (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__ (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.pynccl_comm = PyNcclCommunicator( (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] raise RuntimeError(f"NCCL error: {error_str}") (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) (VllmWorkerProcess pid=235) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), Traceback (most recent call last): (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] init_worker_distributed_environment(self.parallel_config, self.rank, (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] initialize_model_parallel(tensor_model_parallel_size, (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] _TP = init_model_parallel_group(group_ranks, (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] return GroupCoordinator( (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__ (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.pynccl_comm = PyNcclCommunicator( (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] raise RuntimeError(f"NCCL error: {error_str}") (VllmWorkerProcess pid=236) ERROR 08-23 08:25:25 multiproc_worker_utils.py:226] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
UlrikWKoren commented 3 months ago

^ Exactly the same issue with meta-llama/Meta-Llama-3.1-70B-Instruct on docker connected to 2 x h100 nvidia

youkaichao commented 3 months ago

you need to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to provide more information.

github-actions[bot] commented 23 hours ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!