mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

question about positions encoding when apply ROLLING KV CACHE WITH ATTENTION SINKS #73

Closed bugm closed 6 months ago

bugm commented 6 months ago

Hi, Thanks for your nice work! As descried in the paper, image streaming-llm needs to change the positional assigns to the tokens (except the attention sinks ones) to ensuring that the model operates efficiently even beyond its pre-training attention window size. In a normal way, since the positional encoding is changed, we could not use the previous KV cache (except the KV cache in first layer), because they (KV in 2nd layer to last layer) are calculated based on the previous positional encoding . But this will make the complexity becomes O(TL^2). what do you think of this mismatching ? will it influence the performance? if so, maybe we could apply some mechanism to improve it.

Guangxuan-Xiao commented 6 months ago

Hi,

Thanks for your query. In our implementation, we cache the KV states before positional encoding and reapply them during generation. This approach prevents complexity from increasing to O(TL^2). Please see our method here: Modify LLaMA.

Best, Guangxuan