vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.07k stars 3.98k forks source link

[Bug]: all_reduce assert result == 0, File "torch/cuda/graphs.py", line 88, in capture_end super().capture_end(), RuntimeError: CUDA error: operation failed due to a previous error during capture #4432

Open lmx760581375 opened 4 months ago

lmx760581375 commented 4 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.1.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Centos 7 (Final) (x86_64)
GCC version: (GCC) 7.3.0
Clang version: Could not collect
CMake version: version 3.26.1
Libc version: glibc-2.17

Python version: 3.8.12 (default, Nov 11 2021, 20:11:20)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.14.105-1-tlinux3-0013-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 450.156.00
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.0.5
/usr/lib64/libcudnn_adv_infer.so.8.0.5
/usr/lib64/libcudnn_adv_train.so.8.0.5
/usr/lib64/libcudnn_cnn_infer.so.8.0.5
/usr/lib64/libcudnn_cnn_train.so.8.0.5
/usr/lib64/libcudnn_ops_infer.so.8.0.5
/usr/lib64/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
Stepping:              7
CPU MHz:               3099.587
CPU max MHz:           2501.0000
CPU min MHz:           1000.0000
BogoMIPS:              5000.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] pytorchvideo==0.1.5
[pip3] torch==2.1.2+cu118
[pip3] torchaudio==0.9.0
[pip3] torchdata==0.6.0
[pip3] torchvision==0.15.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  mlx5_10 mlx5_11 mlx5_12 mlx5_13 mlx5_14 mlx5_15 mlx5_16   mlx5_17 CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV2     NV1     SYS     SYS     SYS     NV2     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE      NODE    NODE    0-23,48-71      0
GPU1    NV1      X      NV1     NV2     SYS     SYS     NV2     SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE      NODE    NODE    0-23,48-71      0
GPU2    NV2     NV1      X      NV2     SYS     NV1     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX     0-23,48-71      0
GPU3    NV1     NV2     NV2      X      NV1     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX     0-23,48-71      0
GPU4    SYS     SYS     SYS     NV1      X      NV2     NV2     NV1     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS       SYS     SYS     24-47,72-95     1
GPU5    SYS     SYS     NV1     SYS     NV2      X      NV1     NV2     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS       SYS     SYS     24-47,72-95     1
GPU6    SYS     NV2     SYS     SYS     NV2     NV1      X      NV1     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS       SYS     SYS     24-47,72-95     1
GPU7    NV2     SYS     SYS     SYS     NV1     NV2     NV1      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS       SYS     SYS     24-47,72-95     1
mlx5_0  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_1  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_2  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_3  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_4  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_5  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_6  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_7  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_8  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_9  NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_10 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX       PIX     PIX
mlx5_11 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX       PIX     PIX
mlx5_12 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX       PIX     PIX
mlx5_13 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX       PIX     PIX
mlx5_14 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX       PIX     PIX
mlx5_15 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X        PIX     PIX
mlx5_16 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX        X      PIX
mlx5_17 NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX       PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

when i run starcoder2, error come out:

2024-04-28 20:49:41,941 INFO worker.py:1752 -- Started a local Ray instance. INFO 04-28 20:49:43 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/apdcephfs_cq10/share_1567347/share_info/llm_models/starcoder2-15b', tokenizer='/apdcephfs_cq10/share_1567347/share_info/llm_models/starcoder2-15b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) /usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py:87: UserWarning: Failed to get the IP address, using 0.0.0.0 by default.The value can be set by the environment variable HOST_IP. driver_ip = get_ip() (RayWorkerVllm pid=62031) /usr/local/python/lib/python3.8/site-packages/vllm/engine/ray_utils.py:48: UserWarning: Failed to get the IP address, using 0.0.0.0 by default.The value can be set by the environment variable HOST_IP. (RayWorkerVllm pid=62031) return get_ip() INFO 04-28 20:49:50 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. INFO 04-28 20:49:50 selector.py:25] Using XFormers backend. (RayWorkerVllm pid=62111) INFO 04-28 20:49:52 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. (RayWorkerVllm pid=62111) INFO 04-28 20:49:52 selector.py:25] Using XFormers backend. INFO 04-28 20:49:52 pynccl_utils.py:45] vLLM is using nccl==2.10.3 (RayWorkerVllm pid=62111) INFO 04-28 20:49:52 pynccl_utils.py:45] vLLM is using nccl==2.10.3 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Bootstrap : Using eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Socket : Using [0]eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Using network Socket ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Bootstrap : Using eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.18.6+cuda11.8 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NET/Socket : Using [0]eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Using network Socket ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO comm 0x53f905b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x605ed9e94f174b - Init START ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 00/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 01/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO P2P Chunksize set to 131072 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Connected all rings ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Connected all trees ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO comm 0x53f905b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x605ed9e94f174b - Init COMPLETE NCCL version 2.10.3+cuda11.0 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 00/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 01/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1b000] via direct shared memory ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1b000] via direct shared memory ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Connected all rings ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Connected all trees ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO comm 0x5483be20 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Bootstrap : Using eth1:9.91.2.209<0> (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1. (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NET/Socket : Using [0]eth1:9.91.2.209<0> (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Using network Socket (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Channel 00 : 1[1b000] -> 0[1a000] via direct shared memory (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Channel 01 : 1[1b000] -> 0[1a000] via direct shared memory (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Connected all rings (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Connected all trees (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Launch mode Parallel (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO comm 0xbd04190 rank 1 nranks 2 cudaDev 1 busId 1b000 - Init COMPLETE INFO 04-28 20:50:02 model_runner.py:104] Loading model weights took 14.8672 GB ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Using network Socket (RayWorkerVllm pid=62111) INFO 04-28 20:50:12 model_runner.py:104] Loading model weights took 14.8672 GB ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO comm 0x98440b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x59afa392e79b7504 - Init START ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 00/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 01/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO P2P Chunksize set to 131072 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Connected all rings ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Connected all trees ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO comm 0x98440b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x59afa392e79b7504 - Init COMPLETE INFO 04-28 20:50:20 ray_gpu_executor.py:240] # GPU blocks: 15176, # CPU blocks: 6553 INFO 04-28 20:50:23 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 04-28 20:50:23 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (RayWorkerVllm pid=62111) INFO 04-28 20:50:23 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (RayWorkerVllm pid=62111) INFO 04-28 20:50:23 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.

ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] enqueue.cc:267 NCCL WARN Cuda failure 'dependency created on uncaptured work in another stream' ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO enqueue.cc:1045 -> 1 Traceback (most recent call last): File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture hidden_states = self.model( File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 260, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 219, in forward hidden_states = self.embed_tokens(input_ids) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward output = tensor_model_parallel_all_reduce(output_parallel) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce pynccl_utils.allreduce(input) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 55, in all_reduce comm.allreduce(input, op) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 258, in all_reduce assert result == 0 AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "lua_test_file_gen_vllm.py", line 221, in main() File "lua_test_file_gen_vllm.py", line 105, in main llm = LLM(model=args.model, tensor_parallel_size=args.num_gpus, dtype="float16", gpu_memory_utilization=0.9) File "/usr/local/python/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 112, in init self.llm_engine = LLMEngine.from_engine_args( File "/usr/local/python/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 196, in from_engine_args engine = cls( File "/usr/local/python/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, File "/usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py", line 65, in init self._init_cache() File "/usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py", line 253, in _init_cache self._run_workers("warm_up_model") File "/usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/worker.py", line 167, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/usr/local/python/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 854, in capture_model graph_runner.capture( File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture hidden_states = self.model( File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 197, in exit self.cuda_graph.capture_end() File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 88, in capture_end super().capture_end() RuntimeError: CUDA error: operation failed due to a previous error during capture CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

(RayWorkerVllm pid=62111) (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] enqueue.cc:267 NCCL WARN Cuda failure 'dependency created on uncaptured work in another stream' (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO enqueue.cc:1045 -> 1 (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Error executing method warm_up_model. This might cause deadlock in distributed execution. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Traceback (most recent call last): (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.model( (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return self._call_impl(*args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return forward_call(*args, *kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 260, in forward (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.model(input_ids, positions, kv_caches, (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return self._call_impl(args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return forward_call(*args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 219, in forward (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.embed_tokens(input_ids) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return self._call_impl(*args, *kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return forward_call(args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] output = tensor_model_parallel_all_reduce(output_parallel) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] pynccl_utils.allreduce(input) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 55, in all_reduce (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] comm.allreduce(input, op) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 258, in all_reduce (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] assert result == 0 (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] AssertionError (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] During handling of the above exception, another exception occurred: (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Traceback (most recent call last): (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/engine/ray_utils.py", line 37, in execute_method (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return executor(*args, *kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/worker.py", line 167, in warm_up_model (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] self.model_runner.capture_model(self.gpu_cache) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return func(args, **kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 854, in capture_model (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] graph_runner.capture( (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.model( (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 197, in exit (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] self.cuda_graph.capture_end() (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 88, in capture_end (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] super().capture_end() (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] RuntimeError: CUDA error: operation failed due to a previous error during capture (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] (RayWorkerVllm pid=62111) /usr/local/python/lib/python3.8/site-packages/vllm/engine/ray_utils.py:48: UserWarning: Failed to get the IP address, using 0.0.0.0 by default.The value can be set by the environment variable HOST_IP. (RayWorkerVllm pid=62111) return get_ip()

lmx760581375 commented 4 months ago

I found out he can only be found in tensor_parallel_size>1 , the synchronization of multiple nodes is faulty, which corresponds to comm.all_reduce(input, op)

NingNanXin commented 4 months ago

Did you solve it? Same environment and same error. I set the tensor_parallel_size=2

lmx760581375 commented 4 months ago

Did you solve it? Same environment and same error. I set the tensor_parallel_size=2

yeah, I found that this is related to the version of nccl, you have to use your corresponding cuda version of the compiled nccl, it is inferred that the troch version of the nccl can not find the actual function under /usr/local/nccl, resulting in cuda graph calculation error. You can go to NVIDIA's official website to download and reinstall the corresponding version of nccl. I reinstalled an 11.8 compiled nccl and it worked successfully.

NingNanXin commented 4 months ago

thanks for your replay. My host nccl is 2.15+cu118, torch nccl version is 2.18.6(use torch.cuda.nccl.version()), vllm is 0.4.0, and it shows vLLM is using nccl==2.70.8. Did you reinstall an 11.8 compiled nccl and set as ENV? And thanks for your experience

lmx760581375 commented 4 months ago

When you start using vllm, you will see two versions of nccl appear, one for torch, which seems to come from pynccl used by torch, but in fact ends up calling your host's version of nccl. If your log prints another version of nccl that is actually used but not your host version, there may be a problem with your environment variable, and there may be multiple versions of nccl on your host.

NingNanXin commented 4 months ago

When you start using vllm, you will see two versions of nccl appear, one for torch, which seems to come from pynccl used by torch, but in fact ends up calling your host's version of nccl. If your log prints another version of nccl that is actually used but not your host version, there may be a problem with your environment variable, and there may be multiple versions of nccl on your host.

Thanks! There are 4 cuda versions in my host that cause nccl wrong versions for torch