Closed priyamtejaswin closed 1 month ago
I found the problem. This is fixed.
I'll leave the issue open in case someone has thoughts on this approach. Now that the outputs are matching (with and without the interventions) I'll try to finish a draft and see if it works.
I'll try to finish a draft and see if it works.
Hi @priyamtejaswin do you have a draft already? I am also interested in taking this further (especially Lookahead + PagedAttention)
I believe once https://github.com/vllm-project/vllm/pull/2188 is merged you can add Lookahead as the proposer, since verification of tokens is the same.
lookahead decoding supports flash attention and sampling both now.
Closing this as a duplicate of #1742.
The work @cadedaniel mentioned has been completed and the discussion for this feature is more active in the issue I linked above.
Hi, @WoosukKwon and @zhuohan123 ,
Fantastic project!
I was taking a stab at implementing a version of greedy lookahead-decoding. Given some candidate completions, I was trying to:
step
in the engine to parallelize the next token prediction across candidatesI had a question about the behavior of
Sequence.append_token_id
, and its implications on the future engine steps.https://github.com/vllm-project/vllm/blob/24f60a54f42076e0bfa49fde113756bf4e95f9ef/vllm/sequence.py#L159-L167
From the looks of it, if I append a token here, it should add the token to the appropriate blocks.
But when I try this in practice, I get a different output. Suppose the LLM was generating
I intervene after it has generated
912
, and append442
using.append_token_id
, and then callstep()
. But I seeSeeding is not the problem -- I have accounted for that.
Tagging some folks who had previously participated in lookahead/speculative decoding discussions. @simon-mo @LiuXiaoxuanPKU @skrider @beginlner