vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.92k stars 3.79k forks source link

[Usage]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. #6574

Closed jueming0312 closed 1 month ago

jueming0312 commented 1 month ago

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-116-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:         x86_64
CPU op-mode(s):       32-bit, 64-bit
Address sizes:        48 bits physical, 48 bits virtual
Byte Order:           Little Endian
CPU(s):               128
On-line CPU(s) list:  0-15
Off-line CPU(s) list: 16-127
Vendor ID:            AuthenticAMD
Model name:           AMD EPYC 7543 32-Core Processor
CPU family:           25
Model:                1
Thread(s) per core:   2
Core(s) per socket:   32
Socket(s):            2
Stepping:             1
Frequency boost:      enabled
CPU max MHz:          3737.8899
CPU min MHz:          1500.0000
BogoMIPS:             5589.11
Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization:       AMD-V
L1d cache:            2 MiB (64 instances)
L1i cache:            2 MiB (64 instances)
L2 cache:             32 MiB (64 instances)
L3 cache:             512 MiB (16 instances)
NUMA node(s):         2
NUMA node0 CPU(s):    0-31,64-95
NUMA node1 CPU(s):    32-63,96-127

Versions of relevant libraries:
[pip3] flashinfer==0.0.9+cu121torch2.3
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX                             N/A
GPU1    PIX      X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

I'm running the vllm image in Kubernetes, and this error message appears when loading the internlm/internlm2_5-7b-chat model.

(VllmWorkerProcess pid=87) INFO 07-19 10:58:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=87) INFO 07-19 10:58:27 utils.py:737] Found nccl from library libnccl.so.2
INFO 07-19 10:58:27 utils.py:737] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=87) INFO 07-19 10:58:27 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-19 10:58:27 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-19 10:58:27 custom_all_reduce_utils.py:202] generating GPU P2P access cache in /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 07-19 10:58:32 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=87) INFO 07-19 10:58:32 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 07-19 10:58:32 custom_all_reduce.py:127] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=87) WARNING 07-19 10:58:32 custom_all_reduce.py:127] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=87) INFO 07-19 10:59:03 model_runner.py:266] Loading model weights took 7.2232 GB
INFO 07-19 10:59:03 model_runner.py:266] Loading model weights took 7.2232 GB
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] Last error:
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] Error while creating shared memory segment /dev/shm/nccl-bWHyCi (size 9637888), Traceback (most recent call last):
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     self.model_runner.profile_run()
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 923, in profile_run
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1341, in execute_model
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internlm2.py", line 270, in forward
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internlm2.py", line 229, in forward
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     hidden_states = self.tok_embeddings(input_ids)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 350, in forward
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     output = tensor_model_parallel_all_reduce(output_parallel)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return get_tp_group().all_reduce(input_)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 293, in all_reduce
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     torch.distributed.all_reduce(input_, group=self.device_group)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226]     work = group.allreduce([tensor], opts)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] Last error:
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] Error while creating shared memory segment /dev/shm/nccl-bWHyCi (size 9637888)
(VllmWorkerProcess pid=87) ERROR 07-19 10:59:03 multiproc_worker_utils.py:226] 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 282, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 224, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 373, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 520, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 263, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 923, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1341, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internlm2.py", line 270, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internlm2.py", line 229, in forward
[rank0]:     hidden_states = self.tok_embeddings(input_ids)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 350, in forward
[rank0]:     output = tensor_model_parallel_all_reduce(output_parallel)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
[rank0]:     return get_tp_group().all_reduce(input_)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 293, in all_reduce
[rank0]:     torch.distributed.all_reduce(input_, group=self.device_group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank0]: Last error:
[rank0]: Error while creating shared memory segment /dev/shm/nccl-TO0hFk (size 9637888)
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
youkaichao commented 1 month ago

[rank0]: Error while creating shared memory segment /dev/shm/nccl-TO0hFk (size 9637888)

you don't have enough shm for the container. see https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html .

jueming0312 commented 1 month ago

Hello, I have a question on this issue. Will the engine still use shared memory even if my GPU memory is more than enough?

[rank0]: Error while creating shared memory segment /dev/shm/nccl-TO0hFk (size 9637888)

you don't have enough shm for the container. see https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html .

youkaichao commented 1 month ago

shared memory is commonly used for inter process communication, it is irrelevent with your GPU memory.