vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.68k stars 4.08k forks source link

[Feature]: Continuous streaming of `UsageInfo` #5708

Closed tdoublep closed 2 months ago

tdoublep commented 3 months ago

🚀 The feature, motivation and pitch

TLDR: We would like an option we can enable to continuously stream the UsageInfo when using the streaming completions API. This solves a number of "accounting" problems encountered while trying to do accurate performance evaluation.

Motivation: We are working on performance evaluation for vLLM's more advanced features (chunked prefill, speculative decoding) and have run into a few problems that we feel would be solved by adding a simple new feature. Our benchmarking framework fmperf computes ITL by measuring the latency between consequence streaming responses, in addition it computes throughput by inspecting each streaming response and counting the number of tokens in it. Currently vLLM provides no information about how many tokens are contained within each response. In most situation, it is just a single token. However, there are few scenarios where this is not the case:

  1. When chunked prefill is enabled and a prompt gets chunked across multiple iterations, the first few responses may contain zero tokens. There is no special indication that this has happened beyond just an empty string "" being returned. We have found scenarios when "" is actually valid token, so just choosing to ignore this specific response may lead to us discarding responses that are actually valid. It would be nice to have some explicit indication that this response is truly empty.
  2. When speculative decoding is enabled, each streaming response may contain more than one token. Right now we just get the text, rather than the actual tokens so we can't actually tell exactly how many tokens were generated without either (a) running it through a tokenizer again or (b) enabling logprobs or similar which may have performance implications for speculative decoding.

We have considered an architectural change to fmperf where instead of counting the tokens for each streaming response, we simply wait until the final response has been received. vLLM provides the OpenAI-compliant include_usage option that will give us all the stats we need at the very end. This is helpful, but when we are benchmarking we often want to run an experiment for a specific duration and requests that run over the duration will get cancelled on the client-side. Currently, we would have no way to account for the tokens that got generated in such a partially-executed request. Similarly, if a request fails for whatever reason partway through its execution, we don't have any way to get the stats out. There were actually a bunch of comments from people with similar concerns (e.g. here and here) when OpenAI announced this new feature.

We would like to propose adding a simple new option to vLLM to enable continuous streaming of the usage stats with every single streaming response. It is not a big change, but enables more accurate accounting for the number of tokens generated. We can still have the default behaviour remain as-is (usage stats at the end of the request).

Alternatives

see above

Additional context

We are preparing a PR with this feature which we will post shortly.

simon-mo commented 3 months ago

I think the change is welcomed! I would even suggest turn this on by default because it helps client side token counting in general!