vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.52k stars 4.05k forks source link

[Performance] TTFT regression from v0.5.4 to 0.6.2 #8918

Open rickyyx opened 2 hours ago

rickyyx commented 2 hours ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... WARNING 09-27 15:24:15 _custom_ops.py:15] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") /home/ray/default/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm.commit_id' from vllm.version import __version__ as VLLM_VERSION PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.3 Libc version: glibc-2.35 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.5.0-1022-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB GPU 4: NVIDIA A100-SXM4-40GB GPU 5: NVIDIA A100-SXM4-40GB GPU 6: NVIDIA A100-SXM4-40GB GPU 7: NVIDIA A100-SXM4-40GB Nvidia driver version: 535.183.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 7 BogoMIPS: 5999.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke Hypervisor vendor: KVM Virtualization type: full L1d cache: 1.5 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 48 MiB (48 instances) L3 cache: 71.5 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Vulnerable Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.1 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.45.1 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-23,48-71 0 N/A GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-23,48-71 0 N/A GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-23,48-71 0 N/A GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-23,48-71 0 N/A GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 24-47,72-95 1 N/A GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 24-47,72-95 1 N/A GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 24-47,72-95 1 N/A GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 24-47,72-95 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

V0.6.2 engine args Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='meta-llama/Llama-2-7b-chat-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-2-7b-chat-hf, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
V0.5.4 engine args Initializing an LLM engine (v0.5.4) with config: model='meta-llama/Llama-2-7b-chat-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Llama-2-7b-chat-hf, use_v2_block_manager=False, enable_prefix_caching=False)

🐛 Describe the bug

TLDR

We are seeing TTFT regression when upgrading from v0.5.4 to v0.6.2, tldr, on a low QPS/batch size workload, in particularly 15% to 30% TTFT regression on multiple GPUs (A10G, A100) with a small model like llama2-7b-chat-hf

We didn't run more tests on other models, other hardwares, and benchmarks are done with default args mostly:

with benchmark_latency.py

We proxy the TTFT w/o an OpenAI server by running the benchmark_latency.py with ouput_len=1.

Example command:

python benchmark_latency.py   \
--model meta-llama/Llama-2-7b-chat-hf   \
--tensor-parallel-size 1   \
--input-len 512   --output-len 1  \
--batch-size 1   \
--num-iters-warmup 30   \
--num-iters 100

On A10G (avg latency)

On A100:

With benchmark_serving.py

We also profiled with openai server with ShareGPT on a fixed seed at QPS=1. (Reporting 30 requests stats but did run with more requests and the metrics has rather low variance)

Server Command:

vllm serve meta-llama/Llama-2-7b-chat-hf --swap-space 16 --disable-log-requests

Client Command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-2-7b-chat-hf --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1 --num-prompts 30 --seed 10

On v0.5.4 (✅ )

============ Serving Benchmark Result ============
Successful requests:                     30        
Benchmark duration (s):                  31.64     
Total input tokens:                      9643      
Total generated tokens:                  4572      
Request throughput (req/s):              0.95      
Input token throughput (tok/s):          304.76    
Output token throughput (tok/s):         144.49    
---------------Time to First Token----------------
Mean TTFT (ms):                          28.01     
Median TTFT (ms):                        27.01     
P99 TTFT (ms):                           78.94     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.90     
Median TPOT (ms):                        13.80     
P99 TPOT (ms):                           17.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.10     
Median ITL (ms):                         13.54     
P99 ITL (ms):                            36.50     
==================================================

On v0.6.2 (⚠️ )

============ Serving Benchmark Result ============
Successful requests:                     30        
Benchmark duration (s):                  31.12     
Total input tokens:                      9643      
Total generated tokens:                  4572      
Request throughput (req/s):              0.96      
Output token throughput (tok/s):         146.89    
Total Token throughput (tok/s):          456.71    
---------------Time to First Token----------------
Mean TTFT (ms):                          29.86     
Median TTFT (ms):                        31.47     
P99 TTFT (ms):                           68.44     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.80     
Median TPOT (ms):                        12.91     
P99 TPOT (ms):                           13.88     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.07     
Median ITL (ms):                         12.66     
P99 ITL (ms):                            31.47     
==================================================

Before submitting a new issue...

rickyyx commented 2 hours ago

cc @comaniac @simon-mo @KuntaiDu lmk if there's anything you need more info - happy to look into this as well.

comaniac commented 2 hours ago

@KuntaiDu v0.5.4 was released on Aug 5th. Do we have the TTFT data by that time so that we could directly compare against the nightly benchmark results here?