hao-ai-lab / LookaheadDecoding

Apache License 2.0
1.04k stars 63 forks source link

Questions on the attention mask, and whether to accept the last element of guess_results when all guess_tokens are accepted #32

Open YingHH1 opened 7 months ago

YingHH1 commented 7 months ago

It was mentioned in https://github.com/hao-ai-lab/LookaheadDecoding/issues/14 that yellow 7 can see orange1-4, green 5 and red 6. However, as I have thought it was orange 4, green 5, red 6 and yellow 7 that form a 4-gram, and orange 1-3 is irrelevant here so they should be masked, or am I misunderstanding something?

On a different question, if all guess_tokens matches guess_results[0:-1], then should the Lookahead step also accept the last element guess_results[-1]? (since this is a complete sentence)

Many thanks for the help

hsm1997 commented 7 months ago

it was orange 4, green 5, red 6 and yellow 7 that form a 4-gram,

yes, but orange 1-3 is another 3-gram before this 4-gram, so they should also be taken by yellow7 so that yellow7 attends to a complete sentence.

should the Lookahead step also accept the last element guess_results[-1]?

I think this possibly works as well. But conceptually, the guess_results are just used to verify the guess_tokens. The tokens to be accepted should be chosen from "verified guess tokens", not the "tokens used for verification"

YingHH1 commented 7 months ago

yes, but orange 1-3 is another 3-gram before this 4-gram, so they should also be taken by yellow7 so that yellow7 attends to a complete sentence.

But I do not see how orange 1-4 should have any connections at all. They are part of different 4-grams, and orange 1-3 is not a 3-gram (N-gram are those across the different colours) if I am not mistaken. When I inspect an example, I see tokens in orange 1-4 do not form a coherent phrase.

hsm1997 commented 7 months ago

N-gram are those across the different colours

I guess that since the author assign each token with a specific number in blog figure 5, that number stands for the token's "expected position index" within the context

I see tokens in orange 1-4 do not form a coherent phrase

I see that as the decoding process goes on, orange 1-4 gradually form a coherent phrase, for example when steps=15: 截屏2023-12-08 15 13 28

YingHH1 commented 7 months ago

Thank you very much for the response.

I still have trouble understand why orange 1-4 should have connections. I guess this is because we use causal mask in the first context decoding step (where a conventional triangular mask is used so that orange tokens can see their preceding tokens), is this the reason?

If so, why not do the same for green 1-5 and red 1-5 so that they can also see their preceding tokens (so six lower triangles in mask as opposed to the current three under the orange tokens)?

hsm1997 commented 7 months ago

orange 1-4 are "guessed" to be 4-grams, but the "collected 4-gram" are indeed generated in an autoregressive pattern. And if you

do the same for green 1-5 and red 1-5

there would be no autoregressive pattern within the "guess decoding" process, and the probability of "n-gram-guess-is-right" might decrease. (ps. just a personal guess here:-)) Besides, there should not be six lower triangles in masks. For example, green 5 can not attend to orange 1-4 and green 1-4 at the same time.

YingHH1 commented 7 months ago

I think I can slowly grasp what is happening here now. We need that blue 0 and orange 1-4 to build connections so that their corresponding 4-grams (five collected 4-grams in this case) are all relevant to the prompt context. Otherwise, some of the 4-grams are useless as they almost have no connections to the prompt context (even though they are coherent 4-grams). Thus, connecting blue 0 and orange 1-4 in an auto-regressive manner can lead to better acceptance rate, since they are the first tokens to any collected 4-grams.

I guess this is the reason why we want the blue 0 and orange 1-4 to form a sentence :)