hao-ai-lab / LookaheadDecoding

Apache License 2.0
1.04k stars 63 forks source link

question about attention patterns #28

Closed SUDA-HLT-ywfang closed 7 months ago

SUDA-HLT-ywfang commented 7 months ago

Hi! In Figure 5 of the blog, it seems like tokens of the current iteration attend to tokens from previous iterations. For example, the token at position 6 in red attends to token at position 5 in green. But in Jacobi decoding, is it supposed to attend to tokens from the current iteration? That is: the token at position 6 in red attends to token at position 5 in red.

Viol2000 commented 7 months ago

Hi, thanks for your interest! We are a bit different from Jacobi decoding. And the number in fugure 5 shows a relative posision (assuming the input is position 0).

SUDA-HLT-ywfang commented 7 months ago

Thank you for your reply! I'm still a little bit confused.

  1. Without the basic form of Jacobi decoding, how to guarantee that lookahead decoding has the exact same results as autoregressive decoding?
  2. From my understanding, if the sequence is [a, b, c, d, e] and "c" is position 0, the input is [a, b, c] and "e" is position 2, is that right?
Viol2000 commented 7 months ago

Hi,

  1. We use a verification branch to guarantee the output is the same as autoregressive decoding. For example, in Figure 5, we verify two speculations: deep blue 0 + upper blue 1, 2, 3 and deep blue 0 + lower blue 1, 2, 3. This verification is similar to speculative decoding. For example, we compare the softmax output of deep blue 0 and the upper blue 1. If it matches, we will accept upper blue 1 is the next token and go on comparing upper blue 1's output and upper blue 2. And so on.
  2. Yes. If the sequence is [a, b, c, d, e], and 'c' is the current input position 0. Note that a and b will not be input (they are stored in kv-cache). Then, d is position one, and e is position 2.
SUDA-HLT-ywfang commented 7 months ago

In Figure 5, position 6 in red actually attends to position 5 in green (as the red arrow), instead of position 5 in red (as the green arrow). Why is that, considering that position 5 in red is the latest iteration result? So, you can get a more accurate trajectory by attention like this?

image
Viol2000 commented 7 months ago

Hi @FrankCast1e , my idea is that the red 6 is generated by the sequence: some 3, orange 4, green 5. This makes a strong local relation if these 3,4,5 tokens can form an n-gram phase. In this turn, we can use orange 4, green 5, and red 6 to generate the next token to form another meaningful n-gram. If you use red 5 as the previous token of red 6, I think it does not make much sense as the red 6 has no relationship with red5, and it may not generate a meaningful n-gram. And, if you change the last token of red 6, which will be the last token of red 5? I think it should be carefully investigated and form another substitute solution.

SUDA-HLT-ywfang commented 7 months ago

Thank you very much for your explanation! I totally get the idea right now.