vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.74k stars 4.1k forks source link

Lookahead decoding | forking + "appending" to child sequences. #1970

Closed priyamtejaswin closed 1 month ago

priyamtejaswin commented 10 months ago

Hi, @WoosukKwon and @zhuohan123 ,

Fantastic project!

I was taking a stab at implementing a version of greedy lookahead-decoding. Given some candidate completions, I was trying to:

  1. Fork children from the parent sequence
  2. Append new tokens (from the candidates) to the children sequences
  3. Call step in the engine to parallelize the next token prediction across candidates
  4. Verify and select the longest prefix
  5. Append this to the parent sequence
  6. Discard all children sequences created in step#1

I had a question about the behavior of Sequence.append_token_id, and its implications on the future engine steps.

https://github.com/vllm-project/vllm/blob/24f60a54f42076e0bfa49fde113756bf4e95f9ef/vllm/sequence.py#L159-L167

From the looks of it, if I append a token here, it should add the token to the appropriate blocks.

But when I try this in practice, I get a different output. Suppose the LLM was generating

912 -> 442 -> 42

I intervene after it has generated 912, and append 442 using .append_token_id , and then call step(). But I see

912 -> 442 -> 10

Seeding is not the problem -- I have accounted for that.

Tagging some folks who had previously participated in lookahead/speculative decoding discussions. @simon-mo @LiuXiaoxuanPKU @skrider @beginlner

priyamtejaswin commented 10 months ago

I found the problem. This is fixed.

I'll leave the issue open in case someone has thoughts on this approach. Now that the outputs are matching (with and without the interventions) I'll try to finish a draft and see if it works.

learning-chip commented 9 months ago

I'll try to finish a draft and see if it works.

Hi @priyamtejaswin do you have a draft already? I am also interested in taking this further (especially Lookahead + PagedAttention)

cadedaniel commented 8 months ago

I believe once https://github.com/vllm-project/vllm/pull/2188 is merged you can add Lookahead as the proposer, since verification of tokens is the same.

creatorrr commented 7 months ago

lookahead decoding supports flash attention and sampling both now.

https://github.com/hao-ai-lab/LookaheadDecoding

hmellor commented 1 month ago

Closing this as a duplicate of #1742.

The work @cadedaniel mentioned has been completed and the discussion for this feature is more active in the issue I linked above.