While streaming with sinks, how does the framework change the positional encodings of the KV cache without having to multiply with the Key and Value matrices?

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

https://arxiv.org/abs/2309.17453

MIT License

6.36k stars 355 forks source link

While streaming with sinks, how does the framework change the positional encodings of the KV cache without having to multiply with the Key and Value matrices? #53

Open Bhuvanesh09 opened 8 months ago

Bhuvanesh09 commented 8 months ago

From Section 3.2 in the paper:

When determining the relative distance and adding positional information to tokens, StreamingLLM
focuses on positions within the cache rather than those in the original text. This distinction is crucial
for StreamingLLM’s performance. For instance, if the current cache has tokens [0, 1, 2, 3, 6, 7, 8]
and is in the process of decoding the 9th token, the positions assigned are [0, 1, 2, 3, 4, 5, 6, 7], rather
than the positions in the original text, which would be [0, 1, 2, 3, 6, 7, 8, 9].

How is it possible to change the positional embeddings for the cached tokens when the passes are done for the next iteration.

Guangxuan-Xiao commented 8 months ago

The subsequent paragraph in the paper clarifies this:

For encoding like RoPE, we cache the Keys of tokens prior to introducing the rotary transformation. Then, we apply position transformation to the keys in the rolling cache at each decoding phase.

Code reference: https://github.com/mit-han-lab/streaming-llm/blob/main/streaming_llm/pos_shift/modify_llama.py#L103

Bhuvanesh09 commented 8 months ago

@Guangxuan-Xiao , Thanks a lot for the reply. Can't the same be done for Sliding Window with recomputation decoding without attention sinks? At each moment, we simply put a new key-value pair in the rolling buffer and reassign the positional encoding. In such case, the token at (t-L)th time would naturally act as the attention sink without having a special token. This allows the Sliding Window with recomputation be on the order of O(TL) instead of O(TL^2)

Guangxuan-Xiao commented 8 months ago

Can you elaborate on the process you describe or show me a piece of code?

XhqGlorry11 commented 4 months ago

@Guangxuan-Xiao Hi guangxuan, how to deal with position embedding with sink token + sliding window during training?