Question on intuition of "attention sink" and "alibi PE"

bowencohere commented 9 months ago

Hi,

Thanks for the amazing work on streaming-llm. While reading the paper, I came up with this question on why applying "attention sink" also works with models with alibi position embedding. One observation from alibi based models is that the added alibi relative bias results in very small attention scores for the tokens at the beginning of a sequence given a long sequence since the attention score follows an ~exponential decay with sequence length. Is there any explanation you can provide behind this observation as to why attention sink works under this scenario? Maybe I've missed some things from the paper, thanks very much.

BitCalSaul commented 9 months ago

I would also like to seek intuition for "attention sink", and I greatly appreciate your insights. I'd like to share my humble opinions on the matter. There are two primary factors I believe are essential in maintaining a low Perplexity (PPL) for Large Language Models (LLM):

1. Ensuring that the inference window size 'T' is not significantly larger than the trained window size 'L'.

2. Preserving the high-scoring tokens, denoted as 'T_high' (or the initial tokens in StreamliningLLM, or what I consider as the main-feature tokens).

Allow me to provide some reasons for these considerations:

1. The need to be cautious arises due to a potential shift between the training dataset and the inference dataset. Such a shift may inadvertently increase the PPL (that’s for figure 1 (a)).

2. In the context of autoregressive LLM, particularly within the deeper attention blocks, the accumulation of attention scores in 'Thigh' can occur for reasons yet to be fully understood_. This is further compounded by the autoregressive nature of LLM and the high scores achieved by 'T_high'. Subsequently generated tokens, 'T_follow', can be seen as expansion terms derived from the KV value of 'T_high'. When 'T_high's KV values are omitted, reconstructing the next token 'T_next' becomes challenging since the main-feature tokens are lost (that’s for figure 1 (b)).

There are two potential remedies:

a) Retain the main-feature tokens, 'T_high', for predicting 'T_next', which aligns with the approach employed by StreamingLLM (that’s for figure 1 (d)).

b) Substitute 'Thigh' with alternative tokens as main-feature tokens, such as 'T(next-L)', positioned L tokens before 'Tnext' (that’s for figure 1 (c)) . This necessitates sequential attention to reconstruct 'T(next-L+1)', 'T(next-L+2)', and so forth, eventually leading to 'T(next)'. Consequently, the time complexity would become O(1/2(L(L+1))=O(L^2). For T tokens, this translates to O(TL^2). As all L tokens are based on 'T(next-L)', and 'T(next-L)' remains in the queue, the PPL can be effectively maintained at a low level.

Guangxuan-Xiao commented 9 months ago

Hi,

Thank you for your kind words and thoughtful question on StreamingLLM.

Let's address the question you have. Even though models trained with ALiBi position embeddings have an exponential decay in attention scores for the initial tokens in long sequences, they still manifest the "attention sink" phenomenon. This means that the model still dumps attention scores to these initial tokens despite the influence of the ALiBi bias diminishing their importance.

This observation underscores an insight: the attention sink phenomenon isn't solely because of a specific positional encoding method employed. Instead, it's rooted in the intrinsic nature of autoregressive language modeling and the utilization of SoftMax in attention mechanisms.

Interestingly, our group published a paper three years ago titled "Spatten" (link). In this paper, we observed a similar attention sink phenomenon in GPT-2 model, which utilizes learned absolute positional embeddings. But we didn't further explore it at that time :).

I hope this clarifies the mechanism behind the attention sink in models with ALiBi embeddings.

Guangxuan

BitCalSaul commented 9 months ago

@Guangxuan-Xiao this phenomenon of scores accumulating in the first few tokens is quite intriguing.

I attempted to draw a connection to the paper "Vision Transformers Need Registers" (ViT).

In the case of ViT, certain unimportant tokens are automatically designated as a repository for global information, leading to their high norms. Removing these tokens would result in the loss of global information. However, if you were to remove the initial tokens at the beginning of the figure, I believe it wouldn't have a significant impact. This presents an interesting comparison with LLM.

In the case of LLM in streamingLLM, the initial tokens are automatically assigned as the primary features for future tokens. This phenomenon becomes more evident, especially in the deeper attention blocks. My hypothesis is that deeper blocks capture more global information, making it challenging to capture local information for the subsequent token predictions. Consequently, LLM automatically designates certain tokens as the collection of main features. The reason the first few tokens are chosen may be based on autoregression.

It appears that both of these papers share a common approach, where specific tokens are automatically assigned roles with distinct functions – for ViT, a collection for global information, and for LLM, a main-feature collection.

mit-han-lab / streaming-llm

Question on intuition of "attention sink" and "alibi PE" #42