why recompute can differ from window attention?

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

MIT License

6.61k stars 364 forks source link

Let's say the total input tokens are 1024 and the KV-Cache size is 512. While generating the next token, the recomputed representations would totally drop the initial 512 tokens and would just be computed over the most recent 512 tokens as if that was the whole input. Whereas, for cached sliding window, there would be dependency on the first 512 tokens as well. Like this: First 512 representations are trivial. For 513rd, the 1st token representations are dropped, but still the rest of the 511 token repesntations were calculated with the 1st token in context. It works similarly for all the next tokens. Hope it's clear.

mit-han-lab / streaming-llm

why recompute can differ from window attention? #88