vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.19k stars 4.57k forks source link

[Usage]: Potential Hardware Failure when running vllm #7728

Closed NicolasDrapier closed 2 months ago

NicolasDrapier commented 2 months ago

Your current environment

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: openSUSE Tumbleweed (x86_64)
GCC version: (SUSE Linux) 13.2.1 20240206 [revision 67ac78caf31f7cb3202177e6428a46d829b70f23]
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.39

Python version: 3.11.9 (main, Apr 08 2024, 06:18:15) [GCC] (64-bit runtime)
Python platform: Linux-6.8.5-1-default-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
GPU 3: NVIDIA L40S
GPU 4: NVIDIA L40S
GPU 5: NVIDIA L40S
GPU 6: NVIDIA L40S
GPU 7: NVIDIA L40S

Nvidia driver version: 550.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            AuthenticAMD
BIOS Vendor ID:                       Advanced Micro Devices, Inc.
Model name:                           AMD EPYC 9254 24-Core Processor
BIOS Model name:                      AMD EPYC 9254 24-Core Processor                 Unknown CPU @ 2.9GHz
BIOS CPU family:                      107
CPU family:                           25
Model:                                17
Thread(s) per core:                   1
Core(s) per socket:                   24
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU(s) scaling MHz:                   40%
CPU max MHz:                          4151.7568
CPU min MHz:                          1500.0000
BogoMIPS:                             5793.30
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                       AMD-V
L1d cache:                            1.5 MiB (48 instances)
L1i cache:                            1.5 MiB (48 instances)
L2 cache:                             48 MiB (48 instances)
L3 cache:                             256 MiB (8 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-23
NUMA node1 CPU(s):                    24-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flashinfer==0.1.1+cu121torch2.3
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.6.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.535.133
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pytorch-ranger==0.1.1
[pip3] pyzmq==26.0.0
[pip3] torch==2.4.0
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.4.0
[pip3] torchmetrics==1.3.2
[pip3] torchvision==0.19.0
[pip3] transformers==4.43.1
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     0-23    0               N/A
GPU1    PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     0-23    0               N/A
GPU2    SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     0-23    0               N/A
GPU3    SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     0-23    0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     24-47   1               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     24-47   1               N/A
GPU6    SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     24-47   1               N/A
GPU7    SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      24-47   1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

Description:

I've noticed some strange behavior when using vllm. After running a few requests on the server, my hardware sometimes crashes completely—this includes all four physical power supplies shutting off.

I followed the hardware testing procedures as outlined in this guide, and all tests passed successfully.

# docker exec 37b808c71bf7 /usr/bin/dcgmi diag --run 4 --fail-early
Successfully ran diagnostic for group.

+--------------------------------+-----------------------------------------+
|           Diagnostic           |     Result                              |
+--------------------------------+-----------------------------------------+
|           Metadata             |                                         |
| DCGM Version                   | 3.3.7                                   |
| Driver Version Detected        | 550.67                                  |
| GPU Device IDs Detected        | 26b9,26b9,26b9,26b9,26b9,26b9,26b9,26b9 |
+--------------------------------+-----------------------------------------+
|           Deployment           |                                         |
| DenyList                       | Pass                                    |
| NVML Library                   | Pass                                    |
| CUDA Main Library              | Pass                                    |
| Permissions and OS Blocks      | Pass                                    |
| Persistence Mode               | Pass                                    |
| Environment Variables          | Pass                                    |
| Page Retirement/Row Remap      | Pass                                    |
| Graphics Processes             | Pass                                    |
| Inforom                        | Pass                                    |
+--------------------------------+-----------------------------------------+
|           Integration          |                                         |
| PCIe                           | Pass - All                              |
+--------------------------------+-----------------------------------------+
|           Hardware             |                                         |
| GPU Memory                     | Pass - All                              |
| Diagnostic                     | Pass - All                              |
| Pulse Test                     | Pass - All                              |
+--------------------------------+-----------------------------------------+
|           Stress               |                                         |
| Targeted Stress                | Pass - All                              |
| Targeted Power                 | Pass - All                              |
| Memory Bandwidth               | Pass - All                              |
| Memtest                        | Pass - All                              |
| EUV Test                       | Skip - All                              |
+--------------------------------+-----------------------------------------+

However, there are instances where, if all four power supplies don't shut off, one or two of them will physically turn off. I tried running the same workload with text-generation-inference, and while I did not encounter a full server crash, some of the power supplies still tripped.

The command

docker run --rm -it --runtime nvidia \
-e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
--gpus all \
-v /data/vllm/huggingface:/root/vllm/huggingface \
-v /data/models/mistral/mistral-large-instruct-2407:/root/data/mistral-large-instruct-2407 \
-p 8090:8000 \
--ipc=host \
--name vllm-mistral \
vllm/vllm-openai:v0.5.4 \
--model /root/data/mistral-large-instruct-2407 \
--served-model-name general \
--load-format safetensors \
--distributed-executor-backend ray \
--quantization fp8 \
--disable-custom-all-reduce \
--tensor-parallel-size 8 \
--trust-remote-code \
--max-num-seqs 256 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill=False

My Questions:

youkaichao commented 2 months ago

for hardware related problems, the general answer is to ask your admin and vendor.

NicolasDrapier commented 2 months ago

Thank you @youkaichao for your answer.

As I mentioned in my initial post, I’m confident that the hardware isn’t the issue, as all stress tests have run for over 10 hours each without any problems (using gpuburn and dcgm).

I’m simply asking whether a potentially corrupted file could be causing errors with vLLM, which might then trigger a kernel panic or something like this (I’m unable to pinpoint the exact trigger for this issue, and I’m not entirely certain if the kernel panic is indeed the cause).

I suggest this because I’ve encountered this issue with mistralai/Mistral-Large-Instruct-2407 and meta-llama/Meta-Llama-3.1-70B-Instruct, but the problem does not occur with other models I use, such as Qwen/CodeQwen1.5-7B-Chat and microsoft/Phi-3-mini-128k-instruct.

Have there been any similar cases observed? Is there a protocol or method that I could follow to help identify what might be causing this issue?

youkaichao commented 2 months ago

I don't see similar cases. But you might be interested in https://docs.vllm.ai/en/latest/getting_started/debugging.html , there are some tools for you to play around to find more clues.

NicolasDrapier commented 2 months ago

After activating a lot of server logs, it seems that the power supplies are undersized.

Thank you for your help @youkaichao