vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.08k stars 3.12k forks source link

Lookahead decoding #1742

Open TheodoreGalanos opened 7 months ago

TheodoreGalanos commented 7 months ago

They claim lookahead decoding provides a 1.5~2x decoding speedup without a speculative model.

Blog post: https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Twitter thread: https://twitter.com/lmsysorg/status/1727056892671950887 Reference implementation: https://github.com/hao-ai-lab/LookaheadDecoding/tree/main

Not sure if this would be possible to integrate into vllm?

MichaelJayW commented 7 months ago

Hi @WoosukKwon @zhuohan123 2x decoding speedup. Is this possible to integrate into vllm?

Tested it on nvidia A10 with llama 7B: autoregressive: Generated Tokens: 256 Generation Speed: 22.751232013756066 tokens/s lookahead decoding: Generated Tokens: 256 Generation Speed: 58.286426484772186 tokens/s

luofuli commented 7 months ago

Does lookahead decoding only support greedy decoding?

shermansiu commented 7 months ago

Does lookahead decoding only support greedy decoding?

Seems like it.

shermansiu commented 7 months ago

It could probably also work for beam search decoding, but not sampling.

aarnphm commented 7 months ago

How is this compared to speculative decoding?

puddingfjz commented 7 months ago

Will lookahead decoding generate the same results compared with autoregressive decoding?

shermansiu commented 7 months ago

How is this compared to speculative decoding?

This doesn't require a smaller approximation model, also known as a "draft model" in the blog. This work aims to address the limitations of using one, namely that 1. the maximum speedup is limited by the token acceptance rate and 2. creating a good draft model is non-trivial.

shermansiu commented 7 months ago

Will lookahead decoding generate the same results compared with autoregressive decoding?

Yes, it produces the exact same results that would be created by greedy decoding. This work can be easily extended to work with beam search decoding, but not with sampling (at least not in a trivial way).

zhisbug commented 7 months ago

It could probably also work for beam search decoding, but not sampling.

We're working on a feature to support top-p or top-k sampling. It is doable.

zhisbug commented 7 months ago

How is this compared to speculative decoding?

thank @shermansiu for your nice answer, my additional points:

  1. lookahead decoding has no theoretical upper bound on achievable speedup. As long as you're willing to pay flops, you can get unlimited speedup.
  2. No draft model or data store is needed. Draft models or any auxiliary components can be hard to obtain and difficult to deploy.
simon-mo commented 7 months ago

The article does mention the limitation on throughput.

Given this property, lookahead decoding should be used in scenarios where latency is vital, e.g., surplus FLOPs exist that can be traded for latency, or it is even worthwhile to pay extra FLOPs for latency.

Does this mean there will be slowdowns when there are more than one concurrent requests in the batch, especially in non A100 machine used for serving? @zhisbug curious to hear if there's any evaluation done in high throughput front and learn more about the FLOPs and memory penalties of this approach. 🙏

zhisbug commented 7 months ago

The article does mention the limitation on throughput.

Given this property, lookahead decoding should be used in scenarios where latency is vital, e.g., surplus FLOPs exist that can be traded for latency, or it is even worthwhile to pay extra FLOPs for latency.

Does this mean there will be slowdowns when there are more than one concurrent requests in the batch, especially in non A100 machine used for serving? @zhisbug curious to hear if there's any evaluation done in high throughput front and learn more about the FLOPs and memory penalties of this approach. 🙏

Lookahead decoding makes it possible to allocate flops for latency reduction, which is infeasible before because of the sequential dependency. RE your problem: it boils down to a scheduling problem about how you plan to allocate your flops -- for throughput or latency. The problem is not yet solved if you consider them together.

puddingfjz commented 7 months ago

Will lookahead decoding generate the same results compared with autoregressive decoding?

Yes, it produces the exact same results that would be created by greedy decoding. This work can be easily extended to work with beam search decoding, but not with sampling (at least not in a trivial way).

But when lookahead decoding selects some n-gram from the pool, will the tokens in the chosen n-gram be different from what greedy decoding will generate? (because I think the n-gram may not start from the current position when it is generated)

shermansiu commented 7 months ago

But when lookahead decoding selects some n-gram from the pool, will the tokens in the chosen n-gram be different from what greedy decoding will generate? (because I think the n-gram may not start from the current position when it is generated)

The verification step checks whether they are they same and only adds it to the output if they are.

shermansiu commented 7 months ago

Also, it seems like for smaller GPUs, you don't get as big of a performance increase and if the parameters aren't ideal, you actually get a performance decrease.

See the attached GitHub comment for details: https://github.com/huggingface/transformers/issues/27649#issuecomment-1824621466

shermansiu commented 7 months ago

We're working on a feature to support top-p or top-k sampling. It is doable.

Also, for sampling decoding... I suppose you can approximate the conditional next token distribution, but then you'd be sampling from a "close enough"[^1] distribution guesstimated ahead of time, as opposed to from the true conditional distribution.

[^1]: The KL-divergence between the two distributions is below a certain configurable threshold

shermansiu commented 6 months ago

Also, note that discussions for using lookahead decoding with sampling are here: https://github.com/hao-ai-lab/LookaheadDecoding/pull/6

There is indeed a way to validate that the generation from a "close enough" distribution could have been a valid generation from the true distribution.

MeJerry215 commented 5 months ago

In their blog also says

The 7B, 13B, and 33B models require 120x, 80x, and 56x extra FLOPs per step

And I aslo test this decoding method, in auto regression model input_ids' shape is (1, 1) , and LookAheadDecoding input_ids ' shape (1, 103) in llama 13b model.

so It's low lantency but the cost aslo is significant.

creatorrr commented 4 months ago

lookahead decoding supports flash attention and sampling both now.

https://github.com/hao-ai-lab/LookaheadDecoding