[Bug]: Model Launch Hangs with 16+ Ranks in vLLM

wushidonguc commented 1 month ago

Your current environment

Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.2.0-1018-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G GPU 4: NVIDIA A10G GPU 5: NVIDIA A10G GPU 6: NVIDIA A10G GPU 7: NVIDIA A10G

Nvidia driver version: 535.161.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R32 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 0 BogoMIPS: 5599.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 48 MiB (96 instances) L3 cache: 384 MiB (24 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] triton==2.3.0 [pip3] vllm-nccl-cu12==2.18.1.0.4.0 [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU2 PHB PHB X PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU3 PHB PHB PHB X PHB PHB PHB PHB 0-191 0-1 N/A GPU4 PHB PHB PHB PHB X PHB PHB PHB 0-191 0-1 N/A GPU5 PHB PHB PHB PHB PHB X PHB PHB 0-191 0-1 N/A GPU6 PHB PHB PHB PHB PHB PHB X PHB 0-191 0-1 N/A GPU7 PHB PHB PHB PHB PHB PHB PHB X 0-191 0-1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

When launching a model using the vLLM library with 16 ranks (processes) on 2 nodes, the launch hangs indefinitely and fails to complete. However, launching with 8 ranks on 2 nodes works as expected. This issue impacts the ability to run large models that require 16 or more ranks to operate efficiently on multi-node clusters.

Steps to Reproduce:

8 Ranks (Successful): On a Ray cluster with 8 GPUs across two AWS g5.12xlarge machines, confirm the cluster status:


ubuntu@ubuntu:~ $ ray status
======== Autoscaler status: 2024-05-31 21:18:32.605238 ========
Node status
---------------------------------------------------------------
Active:
1 node_1a93b656e273149f144a0ef03fad7c56dfdb0eb2005b289eee44914e
1 node_bdda5bb6d33b5b29813769d6c4238d0e764aa1851f620abb31d4d3e0
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage: 0.0/96.0 CPU 0.0/8.0 GPU 0B/235.77GiB memory 0B/105.04GiB object_store_memory

Demands: (no resource demands)

Launch model run with 8 ranks:

python -m vllm.entrypoints.openai.api_server \ --enforce-eager \ --tensor-parallel-size 8 \ --swap-space 16 \ --gpu-memory-utilization=0.9 \ --model meta-llama/Meta-Llama-3-70B \ --disable-custom-all-reduce \ --disable-log-requests


2. **16 Ranks (Hangs):**
On a Ray cluster with 16 GPUs across two AWS g5.48xlarge machines, confirm the cluster status:

ubuntu@ubuntu:~ $ ray status ======== Autoscaler status: 2024-05-31 20:52:39.099177 ======== Node status

Active: 1 node_5bd8b16e07dc169bc0f7dcc901f47246c4f78a58775ff48675f298db 1 node_023579a04c1324bb4e2cdb146f8be2a530b9e7b0d449f57d3e2e63bd Pending: (no pending nodes) Recent failures: (no failures)

Resources

Usage: 0.0/384.0 CPU 0.0/16.0 GPU 0B/1.00TiB memory 0B/372.53GiB object_store_memory

Demands: (no resource demands)

Launch model run with 16 ranks:

python -m vllm.entrypoints.openai.api_server \ --enforce-eager \ --tensor-parallel-size 16 \ --swap-space 16 \ --gpu-memory-utilization=0.9 \ --model meta-llama/Meta-Llama-3-70B \ --disable-custom-all-reduce \ --disable-log-requests



**Expected Behavior:** The model should successfully launch and load across the specified 16 ranks.

**Impact:** This bug prevents launching and running large models that require a high number of ranks on multi-node clusters.

youkaichao commented 1 month ago

Please set the environment variable export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging to help debugging potential issues.

If you experienced crashes or hangs, it would be helpful to run vllm with export VLLM_TRACE_FUNCTION=1 . All the function calls in vllm will be recorded. Inspect these log files, and tell which function crashes or hangs.

From issue templates.

wushidonguc commented 1 month ago

I tried the above suggestions, and it seems this call causes the hang:

2024-06-03 16:53:30.748291 Call to synchronize in /home/ubuntu/shared/vllm-main/main/lib/python3.10/site-packages/torch/cuda/streams.py:93 from __init__ in /home/ubuntu/shared/vllm-main/vllm/distributed/device_communicators/pynccl.py:101

After commenting out this line, there are more hangs:

2024-06-03 17:03:26.030671 Call to _compute_cos_sin_cache in /home/ubuntu/shared/vllm-main/vllm/model_executor/layers/rotary_embedding.py:85 from __init__ in /home/ubuntu/shared/vllm-main/vllm/model_executor/layers/rotary_embedding.py:66
2024-06-03 17:03:26.030696 Call to _compute_inv_freq in /home/ubuntu/shared/vllm-main/vllm/model_executor/layers/rotary_embedding.py:70 from _compute_cos_sin_cache in /home/ubuntu/shared/vllm-main/vllm/model_executor/layers/rotary_embedding.py:87
2024-06-03 17:03:26.030797 Call to __torch_function__ in /home/ubuntu/shared/vllm-main/main/lib/python3.10/site-packages/torch/utils/_device.py:74 from _compute_inv_freq in /home/ubuntu/shared/vllm-main/vllm/model_executor/layers/rotary_embedding.py:81

vllm-project / vllm