Open double-vin opened 1 week ago
This is because output_len
maps to max_tokens
in async_request_openai_completions
from https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py, which makes it an upper bound rather than a guarantee.
If min_tokens
or ignore_eos
were added to the payload, then the model would generate exactly the number of tokens requested.
@hmellor I wasn't aware of this thanks for highlighting - I think it would be best if we enforced the output_len
to min_tokens
as well to have clarity between deployments.
Sounds good, it probably makes sense to do this for all the backends for consistency.
To clarify, back then when we worked on this serving benchmark, min_tokens
wasn't (and still isn't) an option for multiple backends, and in reality downstream tasks are more likely to specify max_tokens
in their payload (at least that's what I have observed in practice).
Perhaps we could open up the option to let user specify their sampling params but this does mean we need to keep track what's available for each backend internally, which adds some maintenance overhead.
min_tokens
wasn't (and still isn't) an option for multiple backends
Do they not support ignore_eos
?
in reality downstream tasks are more likely to specify
max_tokens
in their payload
Agreed, but for benchmarking, does it not make more sense to generate the number of tokens specified by output_len
? Either that or rename it to max_output_len
to better represent what it actually does?
When I use the qwen model, I still cannot control the Total generated tokens by adding min_tokens to the payload. Example of command: python benchmark_serving.py --model /models/Qwen1.5-7B-Chat --dataset-name random --trust-remote-code --num-prompts 1 --random-input-len 1024 --random-output-len 1024 Output: ============ Serving Benchmark Result ============ Successful requests: 1 Benchmark duration (s): 28.40 Total input tokens: 1024 Total generated tokens: 23
Could you share exactly what change you made to add min_tokens
to the payload? And could you try adding ignore_eos
instead?
When I use the qwen model, I still cannot control the Total generated tokens by adding min_tokens to the payload. Example of command: python benchmark_serving.py --model /models/Qwen1.5-7B-Chat --dataset-name random --trust-remote-code --num-prompts 1 --random-input-len 1024 --random-output-len 1024 Output: ============ Serving Benchmark Result ============ Successful requests: 1 Benchmark duration (s): 28.40 Total input tokens: 1024 Total generated tokens: 23
I also encountered the same issue. I found that the server did indeed generate --random-output-len number of output tokens, but decoding of some tokens resulted in empty characters, which leads to a smaller count of output tokens on the client.
Your current environment
How would you like to use vllm
Example of command: python benchmark_serving.py --model /models/Llama-2-7b-chat-hf/ --dataset-name random --trust-remote-code --num-prompts 1 --radom-input-len 1024 --random-output-len 1024 Output: ============ Serving Benchmark Result ============ Successful requests: 1 Benchmark duration (s): 18.68 Total input tokens: 1024 Total generated tokens: 699 Request throughput (req/s): 0.05 Output token throughput (tok/s): 37.42 Total Token throughput (tok/s): 92.23 ---------------Time to First Token---------------- Mean TTFT (ms): 89.88 Median TTFT (ms): 89.88 P99 TTFT (ms): 89.88 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 26.63 Median TPOT (ms): 26.63 P99 TPOT (ms): 26.63 ---------------Inter-token Latency---------------- Mean ITL (ms): 26.63 Median ITL (ms): 26.57 P99 ITL (ms): 27.49
How does the Total generated tokens value here match with -- random output len 1024? I want to compare its performance with benchmark_throughput. py
Before submitting a new issue...