mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.61k stars 364 forks source link

why recompute can differ from window attention? #88

Open habaohaba opened 1 week ago

habaohaba commented 1 week ago

I think recompute just give the same value of kv state which is saved when using window attention. So what is the difference between recompute and cache version of slide window? Or it is because no matter what position embedding we use, llm juse learn to set first index a large attention value?

darth-c0d3r commented 22 hours ago

Let's say the total input tokens are 1024 and the KV-Cache size is 512. While generating the next token, the recomputed representations would totally drop the initial 512 tokens and would just be computed over the most recent 512 tokens as if that was the whole input. Whereas, for cached sliding window, there would be dependency on the first 512 tokens as well. Like this: First 512 representations are trivial. For 513rd, the 1st token representations are dropped, but still the rest of the 511 token repesntations were calculated with the 1st token in context. It works similarly for all the next tokens. Hope it's clear.