vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.98k stars 3.25k forks source link

[Bug]: ncclSystemError when use two gpus #4383

Closed BeanSprouts closed 2 months ago

BeanSprouts commented 2 months ago

Your current environment

The output of `python collect_env.py`

Collecting environment information... PyTorch version: 2.2.1+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.2 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.11.0-40-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 535.129.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 Stepping: 7 CPU max MHz: 3200.0000 CPU min MHz: 1000.0000 BogoMIPS: 4800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 640 KiB (20 instances) L1i cache: 640 KiB (20 instances) L2 cache: 20 MiB (20 instances) L3 cache: 27.5 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled

Versions of relevant libraries: [pip3] numpy==1.26.3 [pip3] nvidia-nccl-cu11==2.19.3 [pip3] torch==2.2.1+cu118 [pip3] torchaudio==2.2.1+cu118 [pip3] torchvision==0.17.1+cu118 [pip3] triton==2.2.0 [pip3] vllm-nccl-cu11==2.18.1.0.4.0 [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS 0-9,20-29 0 N/A GPU1 SYS X 10-19,30-39 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

My command line order is: python3 -m vllm.entrypoints.openai.api_server --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --model /data/Qwen/Qwen1.5-7B-Chat/ --tokenizer /data/Qwen/Qwen1.5-7B-Chat/ --max-model-len 4096

And the outputs are:

INFO 04-26 01:58:56 api_server.py:151] vLLM API server version 0.4.1 INFO 04-26 01:58:56 api_server.py:152] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen/Qwen1.5-7B-Chat/', tokenizer='/data/Qwen/Qwen1.5-7B-Chat/', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) 2024-04-26 01:58:58,742 WARNING services.py:1996 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 47824896 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2024-04-26 01:58:59,936 INFO worker.py:1749 -- Started a local Ray instance. INFO 04-26 01:59:00 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/data/Qwen/Qwen1.5-7B-Chat/', speculative_config=None, tokenizer='/data/Qwen/Qwen1.5-7B-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 04-26 01:59:06 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu11/libnccl.so.2.18.1 (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu11/libnccl.so.2.18.1 INFO 04-26 01:59:06 selector.py:28] Using FlashAttention backend. (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 selector.py:28] Using FlashAttention backend. INFO 04-26 01:59:06 pynccl_utils.py:43] vLLM is using nccl==2.18.1 a46f6c76528f:2777:2777 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> a46f6c76528f:2777:2777 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory a46f6c76528f:2777:2777 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation a46f6c76528f:2777:2777 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.1+cuda11.0 (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 pynccl_utils.py:43] vLLM is using nccl==2.18.1 a46f6c76528f:2777:2777 [0] NCCL INFO Failed to open libibverbs.so[.1] a46f6c76528f:2777:2777 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> a46f6c76528f:2777:2777 [0] NCCL INFO Using network Socket a46f6c76528f:2777:2777 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff a46f6c76528f:2777:2777 [0] NCCL INFO Channel 00/02 : 0 1 a46f6c76528f:2777:2777 [0] NCCL INFO Channel 01/02 : 0 1 a46f6c76528f:2777:2777 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 a46f6c76528f:2777:2777 [0] NCCL INFO P2P Chunksize set to 131072 a46f6c76528f:2777:2777 [0] NCCL INFO Channel 00 : 0[73000] -> 1[d5000] via SHM/direct/direct a46f6c76528f:2777:2777 [0] NCCL INFO Channel 01 : 0[73000] -> 1[d5000] via SHM/direct/direct a46f6c76528f:2777:2777 [0] NCCL INFO Connected all rings a46f6c76528f:2777:2777 [0] NCCL INFO Connected all trees a46f6c76528f:2777:2777 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 a46f6c76528f:2777:2777 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer a46f6c76528f:2777:2777 [0] NCCL INFO comm 0x5648f6cd5b30 rank 0 nranks 2 cudaDev 0 busId 73000 commId 0xfd02faf687db7405 - Init COMPLETE INFO 04-26 01:59:06 utils.py:129] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json WARNING 04-26 01:59:06 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. a46f6c76528f:2777:2777 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> a46f6c76528f:2777:2777 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation a46f6c76528f:2777:2777 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.19.3+cuda11.0 (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 utils.py:129] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json (RayWorkerWrapper pid=5280) WARNING 04-26 01:59:06 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. a46f6c76528f:2777:5429 [0] NCCL INFO Failed to open libibverbs.so[.1] a46f6c76528f:2777:5429 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> a46f6c76528f:2777:5429 [0] NCCL INFO Using non-device net plugin version 0 a46f6c76528f:2777:5429 [0] NCCL INFO Using network Socket a46f6c76528f:2777:5429 [0] NCCL INFO comm 0x5648f9641550 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 73000 commId 0x21e81c737f151ebb - Init START a46f6c76528f:2777:5429 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff a46f6c76528f:2777:5429 [0] NCCL INFO Channel 00/02 : 0 1 a46f6c76528f:2777:5429 [0] NCCL INFO Channel 01/02 : 0 1 a46f6c76528f:2777:5429 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 a46f6c76528f:2777:5429 [0] NCCL INFO P2P Chunksize set to 131072

a46f6c76528f:2777:5429 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-v2jS20 to 9637892 bytes

a46f6c76528f:2777:5429 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-v2jS20 (size 9637888) a46f6c76528f:2777:5429 [0] NCCL INFO transport/shm.cc:114 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO transport.cc:33 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO transport.cc:97 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO init.cc:1117 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO init.cc:1396 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO group.cc:64 -> 2 [Async thread] a46f6c76528f:2777:2777 [0] NCCL INFO group.cc:418 -> 2 a46f6c76528f:2777:2777 [0] NCCL INFO group.cc:95 -> 2 ERROR 04-26 01:59:07 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution. ERROR 04-26 01:59:07 worker_base.py:157] Traceback (most recent call last): ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method ERROR 04-26 01:59:07 worker_base.py:157] return executor(*args, kwargs) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device ERROR 04-26 01:59:07 worker_base.py:157] init_worker_distributed_environment(self.parallel_config, self.rank, ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 313, in init_worker_distributed_environment ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.all_reduce(torch.zeros(1).cuda()) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper ERROR 04-26 01:59:07 worker_base.py:157] return func(*args, *kwargs) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce ERROR 04-26 01:59:07 worker_base.py:157] work = group.allreduce([tensor], opts) ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ERROR 04-26 01:59:07 worker_base.py:157] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. ERROR 04-26 01:59:07 worker_base.py:157] Last error: ERROR 04-26 01:59:07 worker_base.py:157] Error while creating shared memory segment /dev/shm/nccl-v2jS20 (size 9637888) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 159, in engine = AsyncLLMEngine.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 361, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 319, in init self.engine = self._init_engine(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 437, in _init_engine return engine_class(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in init self.model_executor = executor_class( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in init super().init(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor self._init_workers_ray(placement_group) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray self._run_workers("init_device") File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers driver_worker_output = self.driver_worker.execute_method( File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method raise e File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method return executor(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 313, in init_worker_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: Error while creating shared memory segment /dev/shm/nccl-v2jS20 (size 9637888) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution. (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Traceback (most recent call last): (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] return executor(*args, *kwargs) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] init_worker_distributed_environment(self.parallel_config, self.rank, (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 313, in init_worker_distributed_environment (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.all_reduce(torch.zeros(1).cuda()) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] return func(args, kwargs) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 INFO 04-26 01:58:56 api_server.py:151] vLLM API server version 0.4.1 INFO 04-26 01:58:56 api_server.py:152] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen/Qwen1.5-7B-Chat/', tokenizer='/data/Qwen/Qwen1.5-7B-Chat/', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) 2024-04-26 01:58:58,742 WARNING services.py:1996 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 47824896 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2024-04-26 01:58:59,936 INFO worker.py:1749 -- Started a local Ray instance. INFO 04-26 01:59:00 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/data/Qwen/Qwen1.5-7B-Chat/', speculative_config=None, tokenizer='/data/Qwen/Qwen1.5-7B-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 04-26 01:59:06 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu11/libnccl.so.2.18.1 (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu11/libnccl.so.2.18.1 INFO 04-26 01:59:06 selector.py:28] Using FlashAttention backend. (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 selector.py:28] Using FlashAttention backend. INFO 04-26 01:59:06 pynccl_utils.py:43] vLLM is using nccl==2.18.1 a46f6c76528f:2777:2777 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> a46f6c76528f:2777:2777 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory a46f6c76528f:2777:2777 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation a46f6c76528f:2777:2777 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.1+cuda11.0 (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 pynccl_utils.py:43] vLLM is using nccl==2.18.1 a46f6c76528f:2777:2777 [0] NCCL INFO Failed to open libibverbs.so[.1] a46f6c76528f:2777:2777 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> a46f6c76528f:2777:2777 [0] NCCL INFO Using network Socket a46f6c76528f:2777:2777 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff a46f6c76528f:2777:2777 [0] NCCL INFO Channel 00/02 : 0 1 a46f6c76528f:2777:2777 [0] NCCL INFO Channel 01/02 : 0 1 a46f6c76528f:2777:2777 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 a46f6c76528f:2777:2777 [0] NCCL INFO P2P Chunksize set to 131072 a46f6c76528f:2777:2777 [0] NCCL INFO Channel 00 : 0[73000] -> 1[d5000] via SHM/direct/direct a46f6c76528f:2777:2777 [0] NCCL INFO Channel 01 : 0[73000] -> 1[d5000] via SHM/direct/direct a46f6c76528f:2777:2777 [0] NCCL INFO Connected all rings a46f6c76528f:2777:2777 [0] NCCL INFO Connected all trees a46f6c76528f:2777:2777 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 a46f6c76528f:2777:2777 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer a46f6c76528f:2777:2777 [0] NCCL INFO comm 0x5648f6cd5b30 rank 0 nranks 2 cudaDev 0 busId 73000 commId 0xfd02faf687db7405 - Init COMPLETE INFO 04-26 01:59:06 utils.py:129] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json WARNING 04-26 01:59:06 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. a46f6c76528f:2777:2777 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> a46f6c76528f:2777:2777 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation a46f6c76528f:2777:2777 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.19.3+cuda11.0 (RayWorkerWrapper pid=5280) INFO 04-26 01:59:06 utils.py:129] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json (RayWorkerWrapper pid=5280) WARNING 04-26 01:59:06 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. a46f6c76528f:2777:5429 [0] NCCL INFO Failed to open libibverbs.so[.1] a46f6c76528f:2777:5429 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> a46f6c76528f:2777:5429 [0] NCCL INFO Using non-device net plugin version 0 a46f6c76528f:2777:5429 [0] NCCL INFO Using network Socket a46f6c76528f:2777:5429 [0] NCCL INFO comm 0x5648f9641550 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 73000 commId 0x21e81c737f151ebb - Init START a46f6c76528f:2777:5429 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff a46f6c76528f:2777:5429 [0] NCCL INFO Channel 00/02 : 0 1 a46f6c76528f:2777:5429 [0] NCCL INFO Channel 01/02 : 0 1 a46f6c76528f:2777:5429 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 a46f6c76528f:2777:5429 [0] NCCL INFO P2P Chunksize set to 131072

a46f6c76528f:2777:5429 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-v2jS20 to 9637892 bytes

a46f6c76528f:2777:5429 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-v2jS20 (size 9637888) a46f6c76528f:2777:5429 [0] NCCL INFO transport/shm.cc:114 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO transport.cc:33 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO transport.cc:97 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO init.cc:1117 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO init.cc:1396 -> 2 a46f6c76528f:2777:5429 [0] NCCL INFO group.cc:64 -> 2 [Async thread] a46f6c76528f:2777:2777 [0] NCCL INFO group.cc:418 -> 2 a46f6c76528f:2777:2777 [0] NCCL INFO group.cc:95 -> 2 ERROR 04-26 01:59:07 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution. ERROR 04-26 01:59:07 worker_base.py:157] Traceback (most recent call last): ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method ERROR 04-26 01:59:07 worker_base.py:157] return executor(*args, kwargs) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device ERROR 04-26 01:59:07 worker_base.py:157] init_worker_distributed_environment(self.parallel_config, self.rank, ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 313, in init_worker_distributed_environment ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.all_reduce(torch.zeros(1).cuda()) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper ERROR 04-26 01:59:07 worker_base.py:157] return func(*args, *kwargs) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce ERROR 04-26 01:59:07 worker_base.py:157] work = group.allreduce([tensor], opts) ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ERROR 04-26 01:59:07 worker_base.py:157] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. ERROR 04-26 01:59:07 worker_base.py:157] Last error: ERROR 04-26 01:59:07 worker_base.py:157] Error while creating shared memory segment /dev/shm/nccl-v2jS20 (size 9637888) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 159, in engine = AsyncLLMEngine.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 361, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 319, in init self.engine = self._init_engine(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 437, in _init_engine return engine_class(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in init self.model_executor = executor_class( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 382, in init super().init(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor self._init_workers_ray(placement_group) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray self._run_workers("init_device") File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers driver_worker_output = self.driver_worker.execute_method( File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 158, in execute_method raise e File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method return executor(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 313, in init_worker_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: Error while creating shared memory segment /dev/shm/nccl-v2jS20 (size 9637888) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution. (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Traceback (most recent call last): (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] return executor(*args, *kwargs) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] init_worker_distributed_environment(self.parallel_config, self.rank, (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 313, in init_worker_distributed_environment (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.all_reduce(torch.zeros(1).cuda()) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] return func(args, kwargs) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] work = group.allreduce([tensor], opts) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Last error: (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Error while creating shared memory segment /dev/shm/nccl-KlAZn3 (size 9637888) a46f6c76528f:2777:2777 [0] NCCL INFO comm 0x5648f6cd5b30 rank 0 nranks 2 cudaDev 0 busId 73000 - Destroy COMPLETEworker_base.py:157] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] work = group.allreduce([tensor], opts) (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Last error: (RayWorkerWrapper pid=5280) ERROR 04-26 01:59:07 worker_base.py:157] Error while creating shared memory segment /dev/shm/nccl-KlAZn3 (size 9637888) a46f6c76528f:2777:2777 [0] NCCL INFO comm 0x5648f6cd5b30 rank 0 nranks 2 cudaDev 0 busId 73000 - Destroy COMPLETE

youkaichao commented 2 months ago

2024-04-26 01:58:58,742 WARNING services.py:1996 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 47824896 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.

ERROR 04-26 01:59:07 worker_base.py:157] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. ERROR 04-26 01:59:07 worker_base.py:157] Last error: ERROR 04-26 01:59:07 worker_base.py:157] Error while creating shared memory segment /dev/shm/nccl-v2jS20 (size 9637888)

Looks like you need to configure more space for the shared memory. This bug is not related with vLLM .

jockeyyan commented 3 hours ago

Same issue I encountered, suppose the error was reported as there is no sufficient share memory to create ray instance. Just adding the following lines in docker-commpose.yml:

    build:
      context: .
      shm_size: '10.24gb'
      dockerfile: Dockerfile