Open Bhuvanesh09 opened 8 months ago
The subsequent paragraph in the paper clarifies this:
For encoding like RoPE, we cache the Keys of tokens prior to introducing the rotary transformation. Then, we apply position transformation to the keys in the rolling cache at each decoding phase.
Code reference: https://github.com/mit-han-lab/streaming-llm/blob/main/streaming_llm/pos_shift/modify_llama.py#L103
@Guangxuan-Xiao , Thanks a lot for the reply. Can't the same be done for Sliding Window with recomputation decoding without attention sinks? At each moment, we simply put a new key-value pair in the rolling buffer and reassign the positional encoding. In such case, the token at (t-L)th time would naturally act as the attention sink without having a special token. This allows the Sliding Window with recomputation be on the order of O(TL) instead of O(TL^2)
Can you elaborate on the process you describe or show me a piece of code?
@Guangxuan-Xiao Hi guangxuan, how to deal with position embedding with sink token + sliding window during training?
From Section 3.2 in the paper:
How is it possible to change the positional embeddings for the cached tokens when the passes are done for the next iteration.