[Bug]:The vllm service takes two hours to start Because of NCCL

zhaotyer commented 4 months ago

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.2.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 22 2023, 10:22:35)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.242-1.el7.elrepo.x86_64-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构：                           x86_64
CPU 运行模式：                   32-bit, 64-bit
字节序：                         Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU:                             64
在线 CPU 列表：                  0-63
每个核的线程数：                 1
每个座的核数：                   32
座：                             2
NUMA 节点：                      2
厂商 ID：                        GenuineIntel
CPU 系列：                       6
型号：                           106
型号名称：                       Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
步进：                           6
Frequency boost:                 enabled
CPU MHz：                        800.061
CPU 最大 MHz：                   2601.0000
CPU 最小 MHz：                   800.0000
BogoMIPS：                       5200.00
虚拟化：                         VT-x
L1d 缓存：                       3 MiB
L1i 缓存：                       2 MiB
L2 缓存：                        80 MiB
L3 缓存：                        96 MiB
NUMA 节点0 CPU：                 0-31
NUMA 节点1 CPU：                 32-63
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
标记：                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] onnx==1.15.0
[pip3] paddle2onnx==1.1.0
[pip3] sentence-transformers==2.2.2
[pip3] torch==2.2.1+cu118
[pip3] torchaudio==2.2.1+cu118
[pip3] torchtext==0.5.0
[pip3] torchvision==0.17.1+cu118
[pip3] transformers==4.40.0
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==2.2.0
[pip3] tritonclient==2.19.0
[pip3] vllm-nccl-cu11==2.18.1.0.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-31    0               N/A
GPU1    NV12     X      NV12    NV12    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-31    0               N/A
GPU2    NV12    NV12     X      NV12    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     0-31    0               N/A
GPU3    NV12    NV12    NV12     X      NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     0-31    0               N/A
NIC0    PXB     PXB     NODE    NODE     X      PIX     NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PXB     PXB     NODE    NODE    PIX      X      NODE    NODE    SYS     SYS     SYS     SYS
NIC2    NODE    NODE    PXB     PXB     NODE    NODE     X      PIX     SYS     SYS     SYS     SYS
NIC3    NODE    NODE    PXB     PXB     NODE    NODE    PIX      X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     NODE    NODE
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      NODE    NODE
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

🐛 Describe the bug

The error message is:
INFO 06-07 10:19:27 model.py:266] begin to init ModelHandler
INFO 06-07 10:19:28 config.py:407] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-06-07 10:19:30,701 INFO worker.py:1724 -- Started a local Ray instance.
INFO 06-07 10:19:32 llm_engine.py:79] Initializing an LLM engine with config: model='/models/atom/1/local_model/base_model', tokenizer='/models/atom/1/local_model/base_model', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dt
ype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Bootstrap : Using eth0:10.224.5.210<0>
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.15.5+cuda11.8
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO P2P plugin IBext
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/IB : No device found.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/IB : No device found.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Socket : Using [0]eth0:10.224.5.210<0>
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Using network Socket
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Channel 00/02 :    0   1   2   3
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Channel 01/02 :    0   1   2   3
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Channel 00 : 0[27000] -> 1[2a000] via SHM/direct/direct
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Channel 01 : 0[27000] -> 1[2a000] via SHM/direct/direct
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Connected all rings
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Connected all trees
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO comm 0x560e1a4a0240 rank 0 nranks 4 cudaDev 0 busId 27000 - Init COMPLETE
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO Bootstrap : Using eth0:10.224.5.210<0>
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Loaded net plugin IBext (v5)
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:291 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.6+cuda11.8
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO NET/IB : No device found.
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO NET/Socket : Using [0]eth0:10.224.5.210<0>
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Using network Socket
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO comm 0x560e1ab19ab0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 27000 commId 0xfdb9a1387c96bf27 - Init START
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Channel 00/02 :    0   1   2   3
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Channel 01/02 :    0   1   2   3
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO P2P Chunksize set to 131072
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Connected all rings
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO Connected all trees
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO NCCL_LAUNCH_MODE set by environment to GROUP
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:4483 [0] NCCL INFO comm 0x560e1ab19ab0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 27000 commId 0xfdb9a1387c96bf27 - Init COMPLETE
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Using network Socket
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO comm 0x560e24f599a0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 27000 commId 0xebcd10afa0bbcee4 - Init START
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO cudaDriverVersion 12020
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Bootstrap : Using eth0:10.224.5.210<0>
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO P2P plugin IBext
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/IB : No device found.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/IB : No device found.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Socket : Using [0]eth0:10.224.5.210<0>
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Using network Socket
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Channel 00 : 1[2a000] -> 2[51000] via SHM/direct/direct
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Channel 01 : 1[2a000] -> 2[51000] via SHM/direct/direct
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Connected all rings
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Channel 00 : 1[2a000] -> 0[27000] via SHM/direct/direct
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Channel 01 : 1[2a000] -> 0[27000] via SHM/direct/direct
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Connected all trees
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO comm 0xcd60720 rank 1 nranks 4 cudaDev 1 busId 2a000 - Init COMPLETE
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO cudaDriverVersion 12020
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO Bootstrap : Using eth0:10.224.5.210<0>
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Loaded net plugin IBext (v5)
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4095 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO NET/IB : No device found.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO NET/Socket : Using [0]eth0:10.224.5.210<0>
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO Using network Socket
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO comm 0xd3dd370 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xfdb9a1387c96bf27 - Init START
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO P2P Chunksize set to 131072
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployme
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-depl
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO NCCL_LAUNCH_MODE set by environment to GROUP
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:4484 [1] NCCL INFO comm 0xd3dd370 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xfdb9a1387c96bf27 - Init COMPLETE
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO cudaDriverVersion 12020ESC[32m [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://do
cs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO Bootstrap : Using eth0:10.224.5.210<0>ESC[32m [repeated 4x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.ESC[32m [repeated 8x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)ESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)ESC[32m [repeated 4x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.soESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO P2P plugin IBextESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4486 [3] NCCL INFO NET/IB : No device found.ESC[32m [repeated 6x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4486 [3] NCCL INFO NET/Socket : Using [0]eth0:10.224.5.210<0>ESC[32m [repeated 4x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:5596 [1] NCCL INFO Using network SocketESC[32m [repeated 5x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4486 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to LOCESC[32m [repeated 4x across cluster]ESC[0m
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Channel 00/02 :    0   1   2   3
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Channel 01/02 :    0   1   2   3
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO P2P Chunksize set to 131072
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Connected all rings
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO Connected all trees
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mwh-deployment-cp2nvtm93l1ljsuft5kg-0:291:5595 [0] NCCL INFO comm 0x560e24f599a0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 27000 commId 0xebcd10afa0bbcee4 - Init COMPLETE
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4486 [3] NCCL INFO Setting affinity for GPU 3 to ffffffffESC[32m [repeated 4x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4486 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2ESC[32m [repeated 4x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4229:4485 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/directESC[32m [repeated 19x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4229:4485 [2] NCCL INFO Connected all ringsESC[32m [repeated 5x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4229:4485 [2] NCCL INFO Connected all treesESC[32m [repeated 5x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4229:4485 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512ESC[32m [repeated 5x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peerESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO comm 0xdce9190 rank 3 nranks 4 cudaDev 3 busId 57000 - Init COMPLETEESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4298 [3] NCCL INFO NET/Plugin: Loaded net plugin IBext (v5)ESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:5596 [1] NCCL INFO comm 0x17dc7d60 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xebcd10afa0bbcee4 - Init STARTESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:4486 [3] NCCL INFO P2P Chunksize set to 131072ESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployme
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4229:4485 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peerESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4229:4485 [2] NCCL INFO NCCL_LAUNCH_MODE set by environment to GROUPESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4229)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4229:4485 [2] NCCL INFO comm 0xe03a7d0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 51000 commId 0xfdb9a1387c96bf27 - Init COMPLETEESC[32m [repeated 2x across cluster]ESC[0m
INFO 06-07 12:08:16 llm_engine.py:337] # GPU blocks: 3477, # CPU blocks: 409
INFO 06-07 12:08:20 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-07 12:08:20 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease m
emory usage.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m INFO 06-07 12:08:20 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eag
er' in the CLI.
ESC[36m(RayWorkerVllm pid=4095)ESC[0m INFO 06-07 12:08:20 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the 
`max_num_seqs` as needed to decrease memory usage.
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO Using network SocketESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO Setting affinity for GPU 3 to ffffffffESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2ESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5609 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/directESC[32m [repeated 18x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO Connected all ringsESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO Connected all treesESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512ESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO comm 0x18da7680 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 57000 commId 0xebcd10afa0bbcee4 - Init STARTESC[32m [repeated 2x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO P2P Chunksize set to 131072ESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4298)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4298:5598 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peerESC[32m [repeated 3x across cluster]ESC[0m
ESC[36m(RayWorkerVllm pid=4095)ESC[0m mwh-deployment-cp2nvtm93l1ljsuft5kg-0:4095:5596 [1] NCCL INFO comm 0x17dc7d60 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xebcd10afa0bbcee4 - Init COMPLETEESC[32m [repeated 3x across cluster]ESC[0m
INFO 06-07 12:08:34 model_runner.py:738] Graph capturing finished in 14 secs.
INFO 06-07 12:08:34 model.py:394] vllm load model finished
INFO 06-07 12:08:34 model.py:296] The model loading process takes time: 6547.34 s
INFO 06-07 12:08:34 model.py:1179] Replace eos token: <|im_end|>
INFO 06-07 12:08:34 model.py:1284] Preprocess python backend inited

ENV:
NCCL_P2P_DISABLE=1
NCCL_DEBUG=TRACE
VLLM_TRACE_FUNCTION=1

I tested both with vllm 0.3.1 0.4.1, and the service startup blocked in nccl I hope you can find out the reason. I don't know much about nccl

youkaichao commented 4 months ago

VLLM_TRACE_FUNCTION should not be used unless you are debugging hang/crash.

zhaotyer commented 4 months ago

VLLM_TRACE_FUNCTION should not be used unless you are debugging hang/crash.

I turned it on because the service kept hang when it started

youkaichao commented 4 months ago

Then what is the last function Python executes? This should give you hint on why it hangs.

zhaotyer commented 4 months ago

Then what is the last function Python executes? This should give you hint on why it hangs.

(RayWorkerWrapper pid=4292) INFO 06-11 05:00:27 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2 [repeated 2x across cluster] INFO 06-11 05:00:59 selector.py:28] Using FlashAttention backend. (RayWorkerWrapper pid=4222) INFO 06-11 05:01:01 pynccl_utils.py:43] vLLM is using nccl==2.15.5 (RayWorkerWrapper pid=4086) INFO 06-11 05:00:55 selector.py:28] Using FlashAttention backend. [repeated 2x across cluster] INFO 06-11 05:01:01 pynccl_utils.py:43] vLLM is using nccl==2.15.5

It blocks after printing some NCCL logs

youkaichao commented 4 months ago

You can try the latest version. I don't remember exactly when VLLM_TRACE_FUNCTION is enabled. When it is enabled, you should notice a logging message showing the trace file (which can be quite large).

zhaotyer commented 4 months ago

You can try the latest version. I don't remember exactly when VLLM_TRACE_FUNCTION is enabled. When it is enabled, you should notice a logging message showing the trace file (which can be quite large).

vllm 0.4.1 VLLM_TRACE_FUNCTION is enabled

(RayWorkerWrapper pid=4095) WARNING 06-11 06:20:42 logger.py:125] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(RayWorkerWrapper pid=4095) INFO 06-11 06:20:42 logger.py:129] Trace frame log is saved to /tmp/vllm/vllm-instance-8c06bc620fd54e22a4644b512a767c97/VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log

tail -f VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log

2024-06-11 06:33:24.763706 Call to __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:308 from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271
2024-06-11 06:33:24.763831 Return from __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:311 to safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271
2024-06-11 06:33:24.764090 Return from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:272 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:343
2024-06-11 06:33:24.764189 Call to __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:260 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346
2024-06-11 06:33:24.764243 Return from __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:263 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346
2024-06-11 06:33:24.764362 Call to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:596 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:366
2024-06-11 06:33:24.764422 Call to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:222 from weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597
2024-06-11 06:33:24.764470 Call to get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:196 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764518 Return from get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:200 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764546 Call to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1512 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764589 Call to _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:747 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529
2024-06-11 06:33:24.764617 Return from _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:751 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529
2024-06-11 06:33:24.764664 Call to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:974 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532
2024-06-11 06:33:24.764706 Call to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:948 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976
2024-06-11 06:33:24.764732 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950
2024-06-11 06:33:24.764792 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.764835 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.764858 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950
2024-06-11 06:33:24.764923 Return from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976
2024-06-11 06:33:24.764972 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981
2024-06-11 06:33:24.765019 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765043 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765082 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981
2024-06-11 06:33:24.765104 Return from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532
2024-06-11 06:33:24.765147 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533
2024-06-11 06:33:24.765188 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765212 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765250 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533
2024-06-11 06:33:24.765319 Call to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:762 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536
2024-06-11 06:33:24.765346 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777
2024-06-11 06:33:24.765391 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765442 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765466 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777
2024-06-11 06:33:24.765509 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779
2024-06-11 06:33:24.765560 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779
2024-06-11 06:33:24.765586 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781
2024-06-11 06:33:24.765629 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781
2024-06-11 06:33:24.765689 Return from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:785 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536
2024-06-11 06:33:24.765733 Return from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.765758 Return from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597

youkaichao commented 4 months ago

weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597

which model do you serve? what's the size? is it downloaded or not? it seems your code is still loading the model.

zhaotyer commented 4 months ago

You can try the latest version. I don't remember exactly when VLLM_TRACE_FUNCTION is enabled. When it is enabled, you should notice a logging message showing the trace file (which can be quite large).

vllm 0.4.1 VLLM_TRACE_FUNCTION is enabled

(RayWorkerWrapper pid=4095) WARNING 06-11 06:20:42 logger.py:125] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(RayWorkerWrapper pid=4095) INFO 06-11 06:20:42 logger.py:129] Trace frame log is saved to /tmp/vllm/vllm-instance-8c06bc620fd54e22a4644b512a767c97/VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log

tail -f VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log

2024-06-11 06:33:24.763706 Call to __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:308 from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271
2024-06-11 06:33:24.763831 Return from __getitem__ in /usr/local/lib/python3.8/dist-packages/torch/storage.py:311 to safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:271
2024-06-11 06:33:24.764090 Return from safetensors_weights_iterator in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader/weight_utils.py:272 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:343
2024-06-11 06:33:24.764189 Call to __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:260 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346
2024-06-11 06:33:24.764243 Return from __getattribute__ in /usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py:263 to load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:346
2024-06-11 06:33:24.764362 Call to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:596 from load_weights in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/models/qwen2.py:366
2024-06-11 06:33:24.764422 Call to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:222 from weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597
2024-06-11 06:33:24.764470 Call to get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:196 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764518 Return from get_tensor_model_parallel_group in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:200 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764546 Call to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1512 from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.764589 Call to _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:747 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529
2024-06-11 06:33:24.764617 Return from _rank_not_in_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:751 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1529
2024-06-11 06:33:24.764664 Call to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:974 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532
2024-06-11 06:33:24.764706 Call to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:948 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976
2024-06-11 06:33:24.764732 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950
2024-06-11 06:33:24.764792 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.764835 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.764858 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950
2024-06-11 06:33:24.764923 Return from is_initialized in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:950 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:976
2024-06-11 06:33:24.764972 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981
2024-06-11 06:33:24.765019 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765043 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765082 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981
2024-06-11 06:33:24.765104 Return from _get_default_group in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:981 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1532
2024-06-11 06:33:24.765147 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533
2024-06-11 06:33:24.765188 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765212 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765250 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1533
2024-06-11 06:33:24.765319 Call to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:762 from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536
2024-06-11 06:33:24.765346 Call to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:583 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777
2024-06-11 06:33:24.765391 Call to default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:453 from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765442 Return from default_pg in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:461 to WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585
2024-06-11 06:33:24.765466 Return from WORLD in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:585 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:777
2024-06-11 06:33:24.765509 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779
2024-06-11 06:33:24.765560 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:779
2024-06-11 06:33:24.765586 Call to pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:490 from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781
2024-06-11 06:33:24.765629 Return from pg_group_ranks in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:498 to get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:781
2024-06-11 06:33:24.765689 Return from get_group_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:785 to get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536
2024-06-11 06:33:24.765733 Return from get_rank in /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:1536 to get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224
2024-06-11 06:33:24.765758 Return from get_tensor_model_parallel_rank in /usr/local/lib/python3.8/dist-packages/vllm/distributed/parallel_state.py:224 to weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597

The model weights are actually loaded onto four cards 1718087819(1)

youkaichao commented 4 months ago

what's your model size? it is possible that only parts of the model are loaded, and you need to wait for it to finish loading weights.

zhaotyer commented 4 months ago

weight_loader in /usr/local/lib/python3.8/dist-packages/vllm/model_executor/layers/linear.py:597

which model do you serve? what's the size? is it downloaded or not? it seems your code is still loading the model.

I use model is Qwen1.5-72B-chat, It has been downloaded locally

youkaichao commented 4 months ago

72B can indeed take a long time to load. It is also possible that your disk read is slow.

zhaotyer commented 4 months ago

72B can indeed take a long time to load. It is also possible that your disk read is slow.

I load this model by hugging face transformers, jsut use 4-5minutes, It's probably not the hard drive.

youkaichao commented 4 months ago

how do you load it using transformers?

zhaotyer commented 4 months ago

how do you load it using transformers? the code is:


def huggingface_init(self, base_model_config: dict = {}, lora_model_config: dict = {}):
import torch
from transformers import AutoTokenizer, AutoModel, TextIteratorStreamer, AutoModelForCausalLM
from vllm.config import _get_and_verify_max_len
self._tokenizer = AutoTokenizer.from_pretrained(self.base_model_path, trust_remote_code=True)
self._model = AutoModelForCausalLM.from_pretrained(self.base_model_path, **base_model_config)
# 当lora扩展了tokenizer的vocab时需要加载从lora路经加载self._tokenizer，并扩展基础模型的token_embeddings
# self._tokenizer = AutoTokenizer.from_pretrained(self.lora_model_path, trust_remote_code=True)
# self._model.resize_token_embeddings(len(self._tokenizer))
self._model_max_length = _get_and_verify_max_len(self._model.config, None)
# 遍历并加载peft模型
for index, peft_path in enumerate(self.peft_folders):
peft_name = os.path.basename(peft_path)
logger.info(f"load peft model,name:{peft_name}")
if peft_name in self.sub_models:
error_info = f"peft:{peft_name} has been loaded, loaded model is:{self.sub_models}"
logger.error(error_info)
raise Exception(error_info)

        self._model.load_adapter(peft_path, adapter_name=peft_name)
        self.sub_models[peft_name] = {"path":peft_path, "index":next(self.counter)}
        self._model.set_adapter(peft_name)
        if index == 0:
            self._is_lora_flag = True
        if index == len(self.peft_folders)-1:
            self.default_adapter = peft_name

    logger.info("huggingface load model finished")

def vllm_init(self,  base_model_config:dict={}):
    from vllm.engine.arg_utils import AsyncEngineArgs
    from vllm.engine.async_llm_engine import AsyncLLMEngine
    from transformers import PreTrainedTokenizerBase
    self.lora_model_path = os.path.join(self.model_path, "local_model", "peft_model")
    self.vllm_model_path = self.base_model_path
    # 遍历peft模型
    for index, peft_path in enumerate(self.peft_folders):
        self._is_lora_flag = True
        peft_name = os.path.basename(peft_path)
        self.sub_models[peft_name] = {"path":peft_path, "index":next(self.counter)}
        if index == len(self.peft_folders)-1:
            self.default_adapter = peft_name

    parser = argparse.ArgumentParser()
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    args.model = self.vllm_model_path
    # Adjust according to model and GPU memory size
    args.gpu_memory_utilization = env_manager.gpu_memory_utilization
    cuda_env = env_manager.cuda_visible_devices
    if cuda_env is None:
        from torch.cuda import device_count
        args.tensor_parallel_size = device_count()
    else:
        args.tensor_parallel_size = len(cuda_env.split(",")) if cuda_env else 1
    args.trust_remote_code = base_model_config.get("trust_remote_code", False)
    # args.dtype = 'auto'
    args.enforce_eager = env_manager.enforce_eager
    args.max_log_len = 50
    args.enable_lora = self._is_lora_flag

    engine_args = AsyncEngineArgs.from_cli_args(args)
    self._model = AsyncLLMEngine.from_engine_args(engine_args)
    if isinstance(self._model.engine.tokenizer, PreTrainedTokenizerBase):
        self._tokenizer = self._model.engine.tokenizer
    else:
        self._tokenizer = self._model.engine.tokenizer.tokenizer
    engine_model_config = self._model.engine.get_model_config()
    self._model_max_length = engine_model_config.max_model_len

    # Counter to keep track of ongoing request counts
    self.ongoing_request_count = 0
    self._loop = asyncio.get_event_loop()
    self._loop_thread = Thread(
        target=self.engine_loop, args=(self._loop,)
    )
    self._shutdown_event = asyncio.Event()
    self._lock = asyncio.Lock()
    self._request_id_dict = {}
    self._loop_thread.start()
    logger.info("vllm load model finished")

youkaichao commented 4 months ago

does transformers load your model from disk to cpu or to gpu?

zhaotyer commented 4 months ago

does transformers load your model from disk to cpu or to gpu?

gpu,The configuration is exactly the same as vllm

does transformers load your model from disk to cpu or to gpu?

gpu,The configuration is exactly the same as vllm

youkaichao commented 4 months ago

if i remember correctly, transformers does not support tensor parallel. how can you hold the model with 72B parameters 😕

zhaotyer commented 4 months ago

tail -f VLLM_TRACE_FUNCTION_for_process_4095_thread_140422150293312_at_2024-06-11_06:20:42.214157.log

if i remember correctly, transformers does not support tensor parallel. how can you hold the model with 72B parameters 😕

set device_map='auto' when use AutoModelForCausalLM.from_pretrained(self.base_model_path, **base_model_config) It is able to automatically load models onto four cards by pp

youkaichao commented 4 months ago

I believe current vLLM implementation will load model for tensor_parallel times. So it is expected to take much longer time. If you want to shorten this, you can take a look at the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html .

zhaotyer commented 4 months ago

I believe current vLLM implementation will load model for tensor_parallel times. So it is expected to take much longer time. If you want to shorten this, you can take a look at the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html .

It should have nothing to do with this. Now the weight of each shard is only 4G.

youkaichao commented 4 months ago

Now the weight of each shard is only 4G.

what is this 4G?

You have a 72B model, with 144GB disk file. You use tensor_parallel_size=4, which means 4 process will try to load the model together. In total you need to load 436GB data from disk to memory.

zhaotyer commented 4 months ago

Now the weight of each shard is only 4G.

what is this 4G?

You have a 72B model, with 144GB disk file. You use tensor_parallel_size=4, which means 4 process will try to load the model together. In total you need to load 436GB data from disk to memory.

1718091448(1)

youkaichao commented 4 months ago

well, you have 38 files, each file has about 4 GB, in total summing up to about 144GB.

You use tensor_parallel_size=4, which means 4 process will try to load the model together. In total you need to load 436GB data from disk to memory.

I would say, slow loading in your case is expected. the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html can help to shard the weight according to tensor parallel, so that later you only need to load the corresponding part of weight.

zhaotyer commented 4 months ago

I would say, slow loading in your case is expected. the documentation https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html can help to shard the weight according to tensor parallel, so that later you only need to load the corresponding part of weight.

?,In fact, vllm is currently loaded in slices, and it will not load 436G at the same time as you mentioned. It occupies up to 16G(4*4) at the same time, which has nothing to do with loading blocks.

youkaichao commented 4 months ago

well, it will not load 436G at the same time, but in the end it has to load 436GB from disk in total... if your disk read speed is 200MB/s, then you need 2000s to just read from disk.

zhaotyer commented 4 months ago

well, it will not load 436G at the same time, but in the end it has to load 436GB from disk in total... if your disk read speed is 200MB/s, then you need 2000s to just read from disk.

The juicenfs bandwidth we are using now is indeed only 100M/s, but I don’t know why the GPU usage is 35316MiB in 1 minute for all 4 cards.

zhaotyer commented 4 months ago

well, it will not load 436G at the same time, but in the end it has to load 436GB from disk in total... if your disk read speed is 200MB/s, then you need 2000s to just read from disk.

The juicenfs bandwidth we are using now is indeed only 100M/s, but I don’t know why the GPU usage is 35316MiB in 1 minute for all 4 cards.

Use torch.empty first The empty weight is assigned, then loaded into the CPU, and copied from the CPU to the previously allocated empty weight.

vllm-project / vllm

[Bug]:The vllm service takes two hours to start Because of NCCL #5405

Your current environment

🐛 Describe the bug