Open TheodoreGalanos opened 7 months ago
Hi @WoosukKwon @zhuohan123 2x decoding speedup. Is this possible to integrate into vllm?
Tested it on nvidia A10 with llama 7B: autoregressive: Generated Tokens: 256 Generation Speed: 22.751232013756066 tokens/s lookahead decoding: Generated Tokens: 256 Generation Speed: 58.286426484772186 tokens/s
Does lookahead decoding only support greedy decoding?
Does lookahead decoding only support greedy decoding?
Seems like it.
It could probably also work for beam search decoding, but not sampling.
How is this compared to speculative decoding?
Will lookahead decoding generate the same results compared with autoregressive decoding?
How is this compared to speculative decoding?
This doesn't require a smaller approximation model, also known as a "draft model" in the blog. This work aims to address the limitations of using one, namely that 1. the maximum speedup is limited by the token acceptance rate and 2. creating a good draft model is non-trivial.
Will lookahead decoding generate the same results compared with autoregressive decoding?
Yes, it produces the exact same results that would be created by greedy decoding. This work can be easily extended to work with beam search decoding, but not with sampling (at least not in a trivial way).
It could probably also work for beam search decoding, but not sampling.
We're working on a feature to support top-p or top-k sampling. It is doable.
How is this compared to speculative decoding?
thank @shermansiu for your nice answer, my additional points:
The article does mention the limitation on throughput.
Given this property, lookahead decoding should be used in scenarios where latency is vital, e.g., surplus FLOPs exist that can be traded for latency, or it is even worthwhile to pay extra FLOPs for latency.
Does this mean there will be slowdowns when there are more than one concurrent requests in the batch, especially in non A100 machine used for serving? @zhisbug curious to hear if there's any evaluation done in high throughput front and learn more about the FLOPs and memory penalties of this approach. 🙏
The article does mention the limitation on throughput.
Given this property, lookahead decoding should be used in scenarios where latency is vital, e.g., surplus FLOPs exist that can be traded for latency, or it is even worthwhile to pay extra FLOPs for latency.
Does this mean there will be slowdowns when there are more than one concurrent requests in the batch, especially in non A100 machine used for serving? @zhisbug curious to hear if there's any evaluation done in high throughput front and learn more about the FLOPs and memory penalties of this approach. 🙏
Lookahead decoding makes it possible to allocate flops for latency reduction, which is infeasible before because of the sequential dependency. RE your problem: it boils down to a scheduling problem about how you plan to allocate your flops -- for throughput or latency. The problem is not yet solved if you consider them together.
Will lookahead decoding generate the same results compared with autoregressive decoding?
Yes, it produces the exact same results that would be created by greedy decoding. This work can be easily extended to work with beam search decoding, but not with sampling (at least not in a trivial way).
But when lookahead decoding selects some n-gram from the pool, will the tokens in the chosen n-gram be different from what greedy decoding will generate? (because I think the n-gram may not start from the current position when it is generated)
But when lookahead decoding selects some n-gram from the pool, will the tokens in the chosen n-gram be different from what greedy decoding will generate? (because I think the n-gram may not start from the current position when it is generated)
The verification step checks whether they are they same and only adds it to the output if they are.
Also, it seems like for smaller GPUs, you don't get as big of a performance increase and if the parameters aren't ideal, you actually get a performance decrease.
See the attached GitHub comment for details: https://github.com/huggingface/transformers/issues/27649#issuecomment-1824621466
We're working on a feature to support top-p or top-k sampling. It is doable.
Also, for sampling decoding... I suppose you can approximate the conditional next token distribution, but then you'd be sampling from a "close enough"[^1] distribution guesstimated ahead of time, as opposed to from the true conditional distribution.
[^1]: The KL-divergence between the two distributions is below a certain configurable threshold
Also, note that discussions for using lookahead decoding with sampling are here: https://github.com/hao-ai-lab/LookaheadDecoding/pull/6
There is indeed a way to validate that the generation from a "close enough" distribution could have been a valid generation from the true distribution.
In their blog also says
The 7B, 13B, and 33B models require 120x, 80x, and 56x extra FLOPs per step
And I aslo test this decoding method, in auto regression model input_ids' shape is (1, 1) , and LookAheadDecoding input_ids ' shape (1, 103) in llama 13b model.
so It's low lantency but the cost aslo is significant.
lookahead decoding supports flash attention and sampling both now.
They claim lookahead decoding provides a 1.5~2x decoding speedup without a speculative model.
Blog post: https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Twitter thread: https://twitter.com/lmsysorg/status/1727056892671950887 Reference implementation: https://github.com/hao-ai-lab/LookaheadDecoding/tree/main
Not sure if this would be possible to integrate into vllm?