hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
https://arxiv.org/abs/2402.02057
Apache License 2.0
1.11k stars 65 forks source link

Qs on Understanding Lookahead and Jacobi #37

Closed RonanKMcGovern closed 9 months ago

RonanKMcGovern commented 9 months ago

Thanks for putting this blog together.

  1. Regarding simple jacobi decoding:

    • The process starts with some random guesses for future tokens, correct?
    • As the process continues the guesses improve a little bit, but are still quite poor on average as guesses (because they always are based on a previous token that is wrong). So, only occasionally is a guess correct and therefore the next token can also be accepted.
  2. Regarding look ahead

    • Basically, a bank of n-grams is built up. And those n-grams are built up from decoding guessed input tokens. Correct?
    • I suppose adding the prompt itself (and any confirmed generated tokens) as a source of n-grams would probably also improve performance?
  3. Jacobi Jacobi is mentioned a lot in the blog, but I don't really see that as central... Basically we're just randomly guessing a token that is W tokens away, using previous forward passes to improve the quality of guesses within the window, and then using that as an ngram database?

To further improve the quality of guesses, would it be an idea to just mask the input effect completely - rather than guessing the tokens. My sense is that - because attention is so strong for nearby tokens - that guessing the tokens is worse than actually passing through blank information for that guess position. That would allow the decoded output to be based purely on the info from tokens we do know 100%.

shermansiu commented 9 months ago

I'm not one of the authors, but I can answer.

  1. Yes, random guesses are used initially. See #8. But by the time we've gone through enough passes (i.e. the length of the window), the guesses are no longer completely random (but still noisy).
  2. Yes, a bank of n-grams is built up during decoding guessed input tokens.
  1. Jacobi is central to the idea because we are using the context from the known prefix to generate the subsequent token guesses. The idea is that because the contribution from the known prefix is larger than the guessed tokens after the first few passes, tokens that are closer to the known prefix will be more accurately guessed. There are no guarantees though.

As for your idea, it's certainly viable to use either [MASK] tokens or 0-embeddings. Even if it improves upon the performance, I don't think that the improvement will be huge, but you're welcome to try it out.

RonanKMcGovern commented 9 months ago

Thanks very much. As an aside, apparently TGI tried look ahead but found little speed up for the added compute.

shermansiu commented 9 months ago

You mean Huggingface's transformers, right? Not TGI. But yeah, both Joao Gante and Louis-y-nlp in #19 noticed that you don't get much of a speedup if you don't have the FLOPS to spare.

RonanKMcGovern commented 9 months ago

Yeah makes sense. The comment I was referencing is this one: https://github.com/huggingface/text-generation-inference/issues/1169#issuecomment-1866069892

Thanks

shermansiu commented 9 months ago

I'm assuming that when Olivier Dehaene mentioned that it was tested internally, that he was referring to Joao Gante's test (Gante works at Huggingface). See https://github.com/huggingface/transformers/issues/27649#issuecomment-1824621466 for details.

RonanKMcGovern commented 9 months ago

Wow yeah that's a great post from Joao, thanks for sharing that. I didn't appreciate FA2 compatibility was a consideration too.

shermansiu commented 8 months ago

Incidentally, it seems like the original Jacobi decoding paper uses [PAD] tokens instead of random vocabulary tokens.