microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
554 stars 18 forks source link

[Question]: vertical slash pattern #21

Open SimJeg opened 2 weeks ago

SimJeg commented 2 weeks ago

Describe the issue

Hello,

Vertical lines in attention correspond to "heavy hitters", tokens that are attended every time.

I don't really get what the intuition behind the off diagonal lines in the attention matrix. Is it only an "artifact" related to the periodicity of RoPE ? What is your understanding of this phenomenon ?

iofu728 commented 2 weeks ago

Hi @SimJeg, thanks for your great question.

First, while RoPE is certainly related to the slash pattern, it is not the sole factor. The main reasons are as follows:

1) Even in BERT, a bidirectional encoder with absolute position encoding, some work observe the emergence of such slash sparse attention patterns[1].

Figure 1. The sparse pattern in BERT.

2) Our analysis of T5, an encoder-decoder model with learned relative position embeddings, also shows similar vertical and slash patterns.

Figure 2. The sparse pattern in T5 Encoder using FLAN-UL2.

3) Recent research[2] has discovered similar sparse patterns in MLLMs, including LLaVA and InternVLM.

Figure 3. The sparse pattern in MLLM Encoder.

Regarding the intuition behind these patterns, we currently believe that these patterns represent some kind of information transmission channel, learned from world knowledge. However, this hypothesis requires further theoretical analysis for validation.

[1] SparseBERT: Rethinking the Importance Analysis in Self-Attention, ICML 2021. [2] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, 2024.