vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.38k stars 3.16k forks source link

[Bug]: Pending but Avg generation throughput: 0.0 tokens/s #5267

Open hitsz-zxw opened 1 month ago

hitsz-zxw commented 1 month ago

Your current environment

PyTorch version: 2.1.2+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.35

Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB GPU 4: NVIDIA A100-SXM4-80GB GPU 5: NVIDIA A100-SXM4-80GB GPU 6: NVIDIA A100-SXM4-80GB GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 525.147.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD EPYC 7542 32-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 0 Frequency boost: enabled CPU max MHz: 2900.0000 CPU min MHz: 1500.0000 BogoMIPS: 5788.97 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es Virtualization: AMD-V L1d cache: 2 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 32 MiB (64 instances) L3 cache: 256 MiB (16 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-7,64-71 NUMA node1 CPU(s): 8-15,72-79 NUMA node2 CPU(s): 16-23,80-87 NUMA node3 CPU(s): 24-31,88-95 NUMA node4 CPU(s): 32-39,96-103 NUMA node5 CPU(s): 40-47,104-111 NUMA node6 CPU(s): 48-55,112-119 NUMA node7 CPU(s): 56-63,120-127 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.18.1 [pip3] torch==2.1.2+cu118 [pip3] torchaudio==2.1.2+cu118 [pip3] torchvision==0.16.2+cu118 [pip3] transformers==4.39.3 [pip3] transformers-stream-generator==0.0.5 [pip3] triton==2.1.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.18.1 pypi_0 pypi [conda] torch 2.1.2+cu118 pypi_0 pypi [conda] torchaudio 2.1.2+cu118 pypi_0 pypi [conda] torchvision 0.16.2+cu118 pypi_0 pypi [conda] transformers 4.39.3 pypi_0 pypi [conda] transformers-stream-generator 0.0.5 pypi_0 pypi [conda] triton 2.1.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS 24-31,88-95 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS 24-31,88-95 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB 8-15,72-79 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB 8-15,72-79 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS 56-63,120-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS 56-63,120-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS 40-47,104-111 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS 40-47,104-111 5 NIC0 SYS SYS PXB PXB SYS SYS SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

๐Ÿ› Describe the bug

I used the API interface in the documentation to deploy my offline model, but when I called the following function, the model received requests but did not return content and caused a timeout,I wonder know why.

def generate_text(prompt):

print(prompt)

data = {
    "model": โ€œ******โ€
    "prompt":prompt,
    "max_tokens": 1000,
    "temperature": 0.8,
    "repetition_penalty": 1.1,
    "top_k": -1,
    "top_p": 0.8,
    "n": 3,
}
headers = {'Content-Type': 'application/json'}
response = requests.post(api_url, headers=headers, json=data, stream=True, timeout=60)
response_data = json.loads(response.content)
response_content1 = response_data['choices'][0]["text"]
response_content2 = response_data['choices'][1]["text"]
response_content3 = response_data['choices'][2]["text"]
print(response_content1)
print(response_content2)
print(response_content3)
return response_content1, response_content2, response_content3

The following is the log content๏ผš INFO 06-05 01:14:53 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 92.8%, CPU KV cache usage: 0.0% INFO 06-05 01:15:00 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 92.8%, CPU KV cache usage: 0.0% INFO 06-05 01:16:00 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 92.8%, CPU KV cache usage: 0.0% INFO 06-05 01:17:00 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 92.8%, CPU KV cache usage: 0.0% INFO 06-05 01:18:00 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 92.8%, CPU KV cache usage: 0.0% INFO 06-05 01:19:00 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 92.8%, CPU KV cache usage: 0.0% INFO 06-05 01:20:00 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 92.8%, CPU KV cache usage: 0.0%

Suvralipi commented 1 month ago

Even I am getting similar error, model has been hosted using Kserve + vLLM. It worked for opt-125m but when changed it to 10B model its giving the below error

INFO 06-07 12:24:29 async_llm_engine.py:117] Received request 4191e1b08e1a48a890dd1d07e55f10ae: prompt: 'Triton ์ถ”๋ก  ์„œ๋ฒ„๋ž€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?', sampling params: SamplingParams(n=2, best_of=2, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.0, top_p=1.0, top_k=-1, use_beam_search=True, stop=[], ignore_eos=False, max_tokens=500, logprobs=None), prompt token ids: None. INFO 06-07 12:24:29 llm_engine.py:394] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 06-07 12:24:34 llm_engine.py:394] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 108.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0% INFO 06-07 12:24:38 async_llm_engine.py:171] Finished request 4191e1b08e1a48a890dd1d07e55f10ae. INFO: 127.0.0.6:0 - "POST /generate HTTP/1.1" 200 OK

Used /generate API as per the documentation : token = os.environ['AUTH_TOKEN'] headers = {"Authorization": f"Bearer {token}"} pload = { "prompt": prompt, "n": 2, "use_beam_search": True, "temperature": 0.0, "max_tokens": 500, "stream": False, } response = requests.post(URL, headers=headers, json=pload)