question about positions encoding when apply ROLLING KV CACHE WITH ATTENTION SINKS

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

MIT License

6.38k stars 355 forks source link

Hi, Thanks for your nice work! As descried in the paper, streaming-llm needs to change the positional assigns to the tokens (except the attention sinks ones) to ensuring that the model operates efficiently even beyond its pre-training attention window size. In a normal way, since the positional encoding is changed, we could not use the previous KV cache (except the KV cache in first layer), because they (KV in 2nd layer to last layer) are calculated based on the previous positional encoding . But this will make the complexity becomes O(TL^2). what do you think of this mismatching ? will it influence the performance? if so, maybe we could apply some mechanism to improve it.

mit-han-lab / streaming-llm

question about positions encoding when apply ROLLING KV CACHE WITH ATTENTION SINKS #73