mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

question about initial tokens #54

Open chaojiewang94 opened 9 months ago

chaojiewang94 commented 9 months ago

Thanks for your awesome work. I have some questions about the concept of initial tokens and the implementation of learnable initial tokens.

  1. In Fig.2, you reach the conclusion that existing LLMs tend to pay more attentions to the first (four) tokens of each sentence, and this conclusion is still consistant even if these tokens are replaced with meaningless tokens like '\n'.

So, can I understand this phenomenon as that the position embeddings of the first (four) tokens play an importnat role in attracting the attention weights durring the generation of the following tokens?

  1. In my understanding of your experimental implemenation, you add a learnable initial token at the begining of each sentence to be used to store attention bias. I am curious that will this learnable intial token produce the same effect as the speicial token '' that will be usually automatically added by the tokenizer to indicate the beginning of the sentence

Thanks

Guangxuan-Xiao commented 9 months ago

I appreciate your interest in our work.

  1. The model’s attention to initial tokens isn’t primarily due to position embeddings. In fact, the MPT model we utilize doesn’t employ position embeddings for encoding positions, yet it still exhibits a strong attentional bias towards initial tokens. This suggests the model has an alternative method of identifying initial tokens, possibly learning to recognize them based on their small context during attention calculations.
  2. The BOS token is generally appended to the start of each paragraph before text segmentation, meaning it rarely appears as the initial token in pre-training samples. In our experiments, we intentionally set the first token as a placeholder, which behaves differently from the BOS token.

Guangxuan

chaojiewang94 commented 9 months ago

Thanks for your update, please allow me to ask some further questions

  1. According to your explaination, the accumulation of multi-layer attention mechanisms learned from autoregressive language generation will lead to the attention weights at higher layers (not the frist layer) focus on the initial tokens (no matter it indicates \n or or some other speicial tokens), is that correct?

I am just a little curious about why this phenomenon will also happen on the attention at the first layer? The token embedding of the input first layer is the composition of position embedding and word (semantic) embedding. If the change of initial words (word embeddings) and the remove of position embedding (as you said) do not affect the conclusion, I do understand why ``attention sink''will occur in the first layer, it may be more like a uniform distributed attention

  1. Do you mean the formate of your text corpus after preprocess is like `` context'', like this?