[Bug]: benchmark_serving.py generates different numbers of tokens at different runs

Your current environment

4xH100.

Model Input Dumps

No response

🐛 Describe the bug

When benchmarking the performance of vllm with benchmark_serving.py, it will generate different number of tokens at different runs.

Code to launch vllm server

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --disable-log-requests \
    --tensor-parallel-size 4

Code to run the benchmark

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct\
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --request-rate 1 \
    --num-prompts 200 \
    --save-result

If I run the benchmark_serving.py script twice, the number of generated tokens is different for the two runs. The output of the first run:

============ Serving Benchmark Result ============                                                                                                                                                                     
Successful requests:                     200                                                                                                                                                                           
Benchmark duration (s):                  203.41                                                                                                                                                                        
Total input tokens:                      42659                                                                                                                                                                         
Total generated tokens:                  **38614**                                                                                                                                                                         
Request throughput (req/s):              0.98                                                                                                                                                                          
Output token throughput (tok/s):         189.84                                                                                                                                                                        
Total Token throughput (tok/s):          399.56                                                                                                                                                                        
---------------Time to First Token----------------                                                                                                                                                                     
Mean TTFT (ms):                          62.95                                                                                                                                                                         
Median TTFT (ms):                        64.68                                                                                                                                                                         
P99 TTFT (ms):                           141.49                                                                                                                                                                        
-----Time per Output Token (excl. 1st token)------                                                                                                                                                                     
Mean TPOT (ms):                          20.10                                                                                                                                                                         
Median TPOT (ms):                        19.93                                                                                                                                                                         
P99 TPOT (ms):                           24.28                                                                                                                                                                         
---------------Inter-token Latency----------------                                                                                                                                                                     
Mean ITL (ms):                           19.98                                                                                                                                                                         
Median ITL (ms):                         19.60                                                                                                                                                                         
P99 ITL (ms):                            44.31                                                                                                                                                                         
==================================================

Total generated tokens: 38614
The output of the second run

============ Serving Benchmark Result ============                                                                                                                                                              [3/452]
Successful requests:                     200                                                                                                                                                                           
Benchmark duration (s):                  203.40                                                                                                                                                                        
Total input tokens:                      42659                                                                                                                                                                         
Total generated tokens:                  **38536**     
Request throughput (req/s):              0.98      
Output token throughput (tok/s):         189.46    
Total Token throughput (tok/s):          399.20    
---------------Time to First Token----------------
Mean TTFT (ms):                          60.23     
Median TTFT (ms):                        64.19     
P99 TTFT (ms):                           127.43    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.01     
Median TPOT (ms):                        19.87     
P99 TPOT (ms):                           22.67     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.93     
Median ITL (ms):                         19.57     
P99 ITL (ms):                            43.91     
==================================================

Total generated tokens: 38536 . Even if I relaunch the server for the second run, the randomness still exists.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm