vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.7k stars 4.66k forks source link

[Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json" #8735

Open immusferr opened 2 months ago

immusferr commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text python collect_env.py Collecting environment information... 2024-09-23 17:57:46.577274: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-09-23 17:57:46.594737: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-09-23 17:57:46.616458: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-09-23 17:57:46.622847: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-23 17:57:46.638311: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-09-23 17:57:47.734082: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31 Python version: 3.11.10 | packaged by conda-forge | (main, Sep 10 2024, 11:01:28) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-4.18.0-193.6.3.el8_2.v1.4.x86_64-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA RTX A6000 GPU 1: NVIDIA RTX A6000 Nvidia driver version: 535.129.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz Stepping: 6 CPU MHz: 2200.000 CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4400.00 Virtualization: VT-x L1d cache: 3 MiB L1i cache: 2 MiB L2 cache: 80 MiB L3 cache: 96 MiB NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid md_clear pconfig flush_l1d arch_capabilities Versions of relevant libraries: [pip3] galore-torch==1.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] optree==0.12.1 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchao==0.5.0 [pip3] torchaudio==2.4.0 [pip3] torchtext==0.18.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.0.dev0 [pip3] triton==3.0.0 [pip3] zmq==0.0.0 [conda] galore-torch 1.0 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] optree 0.12.1 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchao 0.5.0 pypi_0 pypi [conda] torchaudio 2.4.0 pypi_0 pypi [conda] torchtext 0.18.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.45.0.dev0 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] zmq 0.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS SYS SYS 0-31,64-95 0 N/A GPU1 SYS X SYS SYS 32-63,96-127 1 N/A NIC0 SYS SYS X PIX NIC1 SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 ```

Model Input Dumps

python vllm_test.py 2024-09-23 17:49:44.893334: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-09-23 17:49:44.910873: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-09-23 17:49:44.932328: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-09-23 17:49:44.938677: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-23 17:49:44.954346: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-09-23 17:49:46.068503: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT WARNING 09-23 17:49:48 utils.py:721] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096). INFO 09-23 17:49:48 config.py:813] Defaulting to use mp for distributed inference INFO 09-23 17:49:48 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='/ofs/llm-for-dptd/modelscope/LLM-Research/gemma-2-27b-it', speculative_config=None, tokenizer='/ofs/llm-for-dptd/modelscope/LLM-Research/gemma-2-27b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/ofs/llm-for-dptd/modelscope/LLM-Research/gemma-2-27b-it, use_v2_block_manager=False, enable_prefix_caching=False) WARNING 09-23 17:49:50 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 09-23 17:49:50 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=2348) INFO 09-23 17:49:50 selector.py:142] Using Flashinfer backend. INFO 09-23 17:49:50 selector.py:142] Using Flashinfer backend. (VllmWorkerProcess pid=2348) INFO 09-23 17:49:50 multiproc_worker_utils.py:215] Worker ready; awaiting tasks [W923 17:49:50.014460463 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:37597 (errno: 97 - Address family not supported by protocol). [W923 17:49:51.076798643 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:37597 (errno: 97 - Address family not supported by protocol). INFO 09-23 17:49:51 utils.py:975] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=2348) INFO 09-23 17:49:51 utils.py:975] Found nccl from library libnccl.so.2 INFO 09-23 17:49:51 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=2348) INFO 09-23 17:49:51 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-23 17:49:51 custom_all_reduce_utils.py:203] generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json

^CINFO 09-23 17:54:16 multiproc_worker_utils.py:136] Terminating local vLLM worker processes

^Crank0: Traceback (most recent call last): rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 220, in gpu_p2p_access_check

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/subprocess.py", line 502, in check_returncode rank0: raise CalledProcessError(self.returncode, self.args, self.stdout, rank0: subprocess.CalledProcessError: Command '['/home/data/miniconda/envs/llm/bin/python', '/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py']' died with <Signals.SIGINT: 2>.

rank0: The above exception was the direct cause of the following exception:

rank0: Traceback (most recent call last): rank0: File "/ofs/llm-for-dptd/modelscope/vllm_test.py", line 10, in rank0: model = LLM(model="/ofs/llm-for-dptd/modelscope/LLM-Research/gemma-2-27b-it",

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 175, in init rank0: self.llm_engine = LLMEngine.from_engine_args(

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 473, in from_engine_args rank0: engine = cls(

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 270, in init rank0: self.model_executor = executor_class(

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init rank0: super().init(*args, **kwargs) rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 46, in init

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, **kwargs)

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 175, in init_device rank0: init_worker_distributed_environment(self.parallel_config, self.rank, rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel rank0: _TP = init_model_parallel_group(group_ranks,

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group rank0: return GroupCoordinator(

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 164, in init rank0: self.ca_comm = CustomAllreduce(

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in init rank0: if not _can_p2p(rank, world_size):

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p rank0: if not gpu_p2p_access_check(rank, i):

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 223, in gpu_p2p_access_check rank0: raise RuntimeError( rank0: RuntimeError: Error happened when batch testing peer-to-peer access from (0, 0, 1, 1) to (0, 1, 0, 1): rank0: 2024-09-23 17:49:54.998211: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. rank0: 2024-09-23 17:49:55.013891: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered rank0: 2024-09-23 17:49:55.032927: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered rank0: 2024-09-23 17:49:55.038641: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered rank0: 2024-09-23 17:49:55.054505: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. rank0: To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. rank0: 2024-09-23 17:49:56.155623: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT rank0: 2024-09-23 17:50:02.535955: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. rank0: 2024-09-23 17:50:02.536583: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. rank0: 2024-09-23 17:50:02.551631: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered rank0: 2024-09-23 17:50:02.551680: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered rank0: 2024-09-23 17:50:02.571366: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered rank0: 2024-09-23 17:50:02.571366: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered rank0: 2024-09-23 17:50:02.577084: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered rank0: 2024-09-23 17:50:02.577130: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered rank0: 2024-09-23 17:50:02.591725: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. rank0: To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. rank0: 2024-09-23 17:50:02.591727: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. rank0: To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. rank0: 2024-09-23 17:50:03.665481: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT rank0: 2024-09-23 17:50:03.791268: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT rank0: Process SpawnProcess-1: rank0: Traceback (most recent call last): rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/process.py", line 108, in run rank0: self._target(*self._args, **self._kwargs) rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 37, in producer rank0: handle = lib.cudaIpcGetMemHandle(pointer)

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_wrapper.py", line 162, in cudaIpcGetMemHandle

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_wrapper.py", line 127, in CUDART_CHECK rank0: raise RuntimeError(f"CUDART error: {error_str}") rank0: RuntimeError: CUDART error: invalid argument rank0: Process SpawnProcess-2: rank0: Traceback (most recent call last): rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 245, in rank0: result = can_actually_p2p(batch_src, batch_tgt)

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 147, in can_actually_p2p

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/process.py", line 149, in join rank0: res = self._popen.wait(timeout)

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/popen_fork.py", line 43, in wait rank0: return self.poll(os.WNOHANG if timeout == 0.0 else 0)

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/popen_fork.py", line 27, in poll rank0: pid, sts = os.waitpid(self.pid, flag)

rank0: Traceback (most recent call last): rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/process.py", line 108, in run rank0: self._target(*self._args, **self._kwargs) rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 67, in consumer rank0: handle = producer_queue.get()

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/queues.py", line 103, in get rank0: res = self._recv_bytes()

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes rank0: buf = self._recv_bytes(maxlength)

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes rank0: buf = self._recv(4)

rank0: File "/home/data/miniconda/envs/llm/lib/python3.11/multiprocessing/connection.py", line 395, in _recv rank0: chunk = read(handle, remaining)

^C

the code i'm using

''' from vllm import LLM import os

os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

model = LLM(model="/ofs/llm-for-dptd/modelscope/LLM-Research/gemma-2-27b-it", dtype="auto", trust_remote_code=True, tokenizer_mode="auto", tensor_parallel_size=2) '''

🐛 Describe the bug

it just stuck here for a few hours...

Before submitting a new issue...

immusferr commented 2 months ago

@youkaichao pls

youkaichao commented 2 months ago

please update to the new vllm version.

and you can pass disable_custom_all_reduce=True to bypass the check. your gpus probably don't have p2p capability.

ruleGreen commented 1 month ago

same problem here.

in some environments, it works well such as

export VLLM_WORKER_MULTIPROC_METHOD=spawn pip install torchvision==0.19.0 pip install torch==2.4.0 pip install deepspeed==0.14.4 pip install vllm==0.6.1.post1

however, when i try to use multiple workers/nodes, the same problem happens. Could you provide some solutions? it seems I can not install latest 0.6.2 due to network issue, also i tried to pass disable_custom_all_reduce=True, and it just exits when loading the model @youkaichao

youkaichao commented 1 month ago

i tried to pass disable_custom_all_reduce=True, and it just exits when loading the model

you need to debug this at first. it might not be an issue from vllm.

berserkr commented 3 weeks ago

I have a node with 8xh100 and am seeing the same issue

youkaichao commented 3 weeks ago

@berserkr please run the test script at https://docs.vllm.ai/en/latest/getting_started/debugging.html#incorrect-hardware-driver first

yxchng commented 2 days ago

have anyone solved this?