If I run the benchmark_serving.py script twice, the number of generated tokens is different for the two runs.
The output of the first run:
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 203.41
Total input tokens: 42659
Total generated tokens: **38614**
Request throughput (req/s): 0.98
Output token throughput (tok/s): 189.84
Total Token throughput (tok/s): 399.56
---------------Time to First Token----------------
Mean TTFT (ms): 62.95
Median TTFT (ms): 64.68
P99 TTFT (ms): 141.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 20.10
Median TPOT (ms): 19.93
P99 TPOT (ms): 24.28
---------------Inter-token Latency----------------
Mean ITL (ms): 19.98
Median ITL (ms): 19.60
P99 ITL (ms): 44.31
==================================================
Total generated tokens: 38614
The output of the second run
============ Serving Benchmark Result ============ [3/452]
Successful requests: 200
Benchmark duration (s): 203.40
Total input tokens: 42659
Total generated tokens: **38536**
Request throughput (req/s): 0.98
Output token throughput (tok/s): 189.46
Total Token throughput (tok/s): 399.20
---------------Time to First Token----------------
Mean TTFT (ms): 60.23
Median TTFT (ms): 64.19
P99 TTFT (ms): 127.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 20.01
Median TPOT (ms): 19.87
P99 TPOT (ms): 22.67
---------------Inter-token Latency----------------
Mean ITL (ms): 19.93
Median ITL (ms): 19.57
P99 ITL (ms): 43.91
==================================================
Total generated tokens: 38536 .
Even if I relaunch the server for the second run, the randomness still exists.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Can you save the json results and see which individual requests have different number of output tokens? We should be able to inspect the generated text too.
Your current environment
4xH100.
Model Input Dumps
No response
🐛 Describe the bug
When benchmarking the performance of vllm with
benchmark_serving.py
, it will generate different number of tokens at different runs.Code to launch vllm server
Code to run the benchmark
If I run the benchmark_serving.py script twice, the number of generated tokens is different for the two runs. The output of the first run:
Total generated tokens: 38614
The output of the second run
Total generated tokens: 38536 . Even if I relaunch the server for the second run, the randomness still exists.
Before submitting a new issue...