vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.25k stars 4k forks source link

[Bug]: benchmark_serving.py generates different numbers of tokens at different runs #8531

Open LiuXiaoxuanPKU opened 6 days ago

LiuXiaoxuanPKU commented 6 days ago

Your current environment

4xH100.

Model Input Dumps

No response

🐛 Describe the bug

When benchmarking the performance of vllm with benchmark_serving.py, it will generate different number of tokens at different runs.

Code to launch vllm server

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --disable-log-requests \
    --tensor-parallel-size 4

Code to run the benchmark

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct\
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --request-rate 1 \
    --num-prompts 200 \
    --save-result

If I run the benchmark_serving.py script twice, the number of generated tokens is different for the two runs. The output of the first run:

============ Serving Benchmark Result ============                                                                                                                                                                     
Successful requests:                     200                                                                                                                                                                           
Benchmark duration (s):                  203.41                                                                                                                                                                        
Total input tokens:                      42659                                                                                                                                                                         
Total generated tokens:                  **38614**                                                                                                                                                                         
Request throughput (req/s):              0.98                                                                                                                                                                          
Output token throughput (tok/s):         189.84                                                                                                                                                                        
Total Token throughput (tok/s):          399.56                                                                                                                                                                        
---------------Time to First Token----------------                                                                                                                                                                     
Mean TTFT (ms):                          62.95                                                                                                                                                                         
Median TTFT (ms):                        64.68                                                                                                                                                                         
P99 TTFT (ms):                           141.49                                                                                                                                                                        
-----Time per Output Token (excl. 1st token)------                                                                                                                                                                     
Mean TPOT (ms):                          20.10                                                                                                                                                                         
Median TPOT (ms):                        19.93                                                                                                                                                                         
P99 TPOT (ms):                           24.28                                                                                                                                                                         
---------------Inter-token Latency----------------                                                                                                                                                                     
Mean ITL (ms):                           19.98                                                                                                                                                                         
Median ITL (ms):                         19.60                                                                                                                                                                         
P99 ITL (ms):                            44.31                                                                                                                                                                         
================================================== 

Total generated tokens: 38614
The output of the second run

============ Serving Benchmark Result ============                                                                                                                                                              [3/452]
Successful requests:                     200                                                                                                                                                                           
Benchmark duration (s):                  203.40                                                                                                                                                                        
Total input tokens:                      42659                                                                                                                                                                         
Total generated tokens:                  **38536**     
Request throughput (req/s):              0.98      
Output token throughput (tok/s):         189.46    
Total Token throughput (tok/s):          399.20    
---------------Time to First Token----------------
Mean TTFT (ms):                          60.23     
Median TTFT (ms):                        64.19     
P99 TTFT (ms):                           127.43    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.01     
Median TPOT (ms):                        19.87     
P99 TPOT (ms):                           22.67     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.93     
Median ITL (ms):                         19.57     
P99 ITL (ms):                            43.91     
==================================================

Total generated tokens: 38536 . Even if I relaunch the server for the second run, the randomness still exists.

Before submitting a new issue...

ywang96 commented 6 days ago

Can you save the json results and see which individual requests have different number of output tokens? We should be able to inspect the generated text too.