vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.67k stars 4.48k forks source link

[Bug]: command r server hangs randomly with no error #7913

Closed nivibilla closed 2 months ago

nivibilla commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-1065-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L4 GPU 1: NVIDIA L4 GPU 2: NVIDIA L4 GPU 3: NVIDIA L4 GPU 4: NVIDIA L4 GPU 5: NVIDIA L4 GPU 6: NVIDIA L4 GPU 7: NVIDIA L4 Nvidia driver version: 535.161.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R13 Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 1 BogoMIPS: 5299.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 48 MiB (96 instances) L3 cache: 384 MiB (12 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] optree==0.12.1 [pip3] pyzmq==23.2.0 [pip3] sentence-transformers==2.7.0 [pip3] torch==2.4.0 [pip3] torcheval==0.0.7 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-47,96-143 0 N/A GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-47,96-143 0 N/A GPU4 SYS SYS SYS SYS X NODE NODE NODE 48-95,144-191 1 N/A GPU5 SYS SYS SYS SYS NODE X NODE NODE 48-95,144-191 1 N/A GPU6 SYS SYS SYS SYS NODE NODE X NODE 48-95,144-191 1 N/A GPU7 SYS SYS SYS SYS NODE NODE NODE X 48-95,144-191 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

Server runs fine for the first 100 or so requests and then randomly hangs without any error message

INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-a9ac2e0376b44cda9aeb5f4635d2fa12.
INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-d13358ef80e04122a1171f9b7419dad4.
INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-fb808f8709124805ad06c470340557b1.
INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-30c54579ff5441ddb0a3b0166c19f828.
INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-43d885b1665d46e4a43c3d86b48cd737.
INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-4ab0462d1c4d41e7abab908eee2b2355.
INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-7951c7896aee40d3a1d4d97f8545d3a0.
INFO 08-27 13:53:13 async_llm_engine.py:208] Added request chat-29b49e4d8e0549d0acfb416b1f29873a.
INFO 08-27 13:53:14 metrics.py:351] Avg prompt throughput: 9293.3 tokens/s, Avg generation throughput: 30.9 tokens/s, Running: 21 reqs, Swapped: 0 reqs, Pending: 11 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%.
INFO 08-27 13:53:14 metrics.py:367] Prefix cache hit rate: GPU: 91.31%, CPU: 0.00%
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-2b667d636e4a4fb7a62c259b15b1a87f.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-98b529a937b84b52a18e1a9bb2b7da24.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-956c82b1642046c689227962a69fb271.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-e4a09779573c420c989435a04ef04c28.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-5a4a21c3c5ef4107afa62baaf2b1acba.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-32dc68ad3cc540ad8610bf3c3d867140.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-24754c22d44f4373bdd08e8e3598182c.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-2a16a58d56314a30a06ee75b8a16cb1d.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-635b2aa24e274c65b901a381181a1bc7.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-a83ff738e3ac44a695b4fafa3226b0a8.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-06cf60bd33d644ab82367c2e67c15709.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-98f9e60e027d4546bed56424f4c7ca89.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-f97177ca26e04e7bbc614d1bc6703780.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-c89ef7b4ef734d548a2ab5f3263eb78b.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-70d470118acc4a60ad7426878ac42513.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-cb5d01623ec941b1bace93f993dffea1.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-f7951b91a5734a7f991b6131818cef8a.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-5d4989dd21644d6e8a259fcf6953d179.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-e3987bc76aaa4aaa9303607bd04de854.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-ed730dad1e724cee84dc47076bda0e2f.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-9c50a56a6a7f47ea863b349e10b3a20c.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-a9ac2e0376b44cda9aeb5f4635d2fa12.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-d13358ef80e04122a1171f9b7419dad4.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-fb808f8709124805ad06c470340557b1.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-30c54579ff5441ddb0a3b0166c19f828.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-43d885b1665d46e4a43c3d86b48cd737.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-4ab0462d1c4d41e7abab908eee2b2355.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-7951c7896aee40d3a1d4d97f8545d3a0.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-29b49e4d8e0549d0acfb416b1f29873a.
INFO:     10.168.91.232:43680 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43736 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43746 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43754 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43770 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43778 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43796 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43812 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43828 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43890 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43910 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43942 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-ff4b802b30b04f8a91c44a30e0f46af8.
INFO 08-27 13:53:15 async_llm_engine.py:176] Finished request chat-467706a6aa83458fac38f4b5c5ac7767.
INFO:     10.168.91.232:43722 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.168.91.232:43876 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-08-27 13:53:17.570305: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-27 13:53:17.646091: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-27 13:53:17.670085: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-27 13:53:17.675796: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-27 13:53:17.681875: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-27 13:53:17.712359: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-27 13:53:17.713148: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-27 13:53:17.713188: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Before submitting a new issue...

nivibilla commented 2 months ago

not sure what the issue was but using ray as the multiprocessing backend seems to fix it. Ive seen this issue across models where i get the 'background loop has errored' or 'there appears to be 1 leaked ....' happen when not using ray. @njhill tagging you in case its useful.

youkaichao commented 2 months ago

2024-08-27 13:53:17.570305: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

why does the log use tensorflow?

nivibilla commented 2 months ago

@youkaichao not sure, some issue with databricks. Ive seen this happen everytime pytorch is loaded. which brings the question of why is pytorch loaded multiple times during a server

youkaichao commented 2 months ago

forwarding to databricks folks