[Performance]: vllm量化和非量化的性能对比

xllrun commented 3 months ago

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

使用qwen32b模型，输入tokens 3000，输出tokens 50左右，在单张A800并发量在11，最大响应时间达到10s，使用awq将qwen32b模型量化为4bit,输入tokens 3000，输出tokens 50左右，无论是显卡利用率设置为0.5还是0.95在单张A800并发量也是在11，最大响应时间达到10s，想知道这是为什么，是什么参数设置不对？理论上量化后显存占用小了，并发数会增加啊

Your current environment (if you think it is necessary)

Collecting environment information... PyTorch version: 2.2.2+cpu Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-1050-azure-x86_64-with-glibc2.31 Is CUDA available: False CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe Nvidia driver version: 550.54.15 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7V13 64-Core Processor Stepping: 1 CPU MHz: 2445.438 BogoMIPS: 4890.87 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 768 KiB L1i cache: 768 KiB L2 cache: 12 MiB L3 cache: 96 MiB NUMA node0 CPU(s): 0-23 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.2.2+cpu [pip3] transformers==4.44.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.2.2+cpu pypi_0 pypi [conda] transformers 4.44.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS 0-23 0 N/A NIC0 SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_an0

jyyang621 commented 3 months ago

能问一下怎么启的量化模型吗

jyyang621 commented 3 months ago

能问一下怎么启的量化模型吗

xllrun commented 3 months ago

能问一下怎么启的量化模型吗

python -m vllm.entrypoints.openai.api_server --model "**" --served-model-name "" --gpu_memory_utilization 0.5 --quantization awq

TragedyN commented 3 months ago

并发量也和max_num_batched_tokens设置有关

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

vllm-project / vllm