mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.59k stars 361 forks source link

Questions on "streaming-llm" Paper #12

Closed llsj14 closed 1 year ago

llsj14 commented 1 year ago

Firstly, I'd like to express my appreciation for your insightful paper and the open-source 'streaming-llm'. Your approach and experiments are truly commendable. I hope you don't mind, I would really appreciate it if you could give me some hints about the below questions.

  1. As mentioned in your paper, the attention score of initial tokens seems crucial. I was under the impression that Dense Attention might be more effective than Window Attention, especially since Dense Attention consistently calculates KV values of initial tokens. Is it because Window Attention performs better because Dense Attention struggles with significantly longer input lengths, affecting Length Extrapolation?

  2. Regarding Figure 1-(b), the PPL of Window Attention seems the lowest, but it has an 'X' label. Maybe the PPL score of Window Attention should be larger in real, or it shouldn't take 'X' label?

I appreciate any clarifications you can provide.

tomaarsen commented 1 year ago

Hello!

I'm not related to the publishing of this paper, but I have done my fair share of experiments surrounding this work.

  1. As mentioned in your paper, the attention score of initial tokens seems crucial. I was under the impression that Dense Attention might be more effective than Window Attention, especially since Dense Attention consistently calculates KV values of initial tokens. Is it because Window Attention performs better because Dense Attention struggles with significantly longer input lengths, affecting Length Extrapolation?

My belief is yes. There are two primary issues with Dense Attention:

  1. It's not scalable due to the linear increase in memory usage, making it unfit for handling many tokens.
  2. (related to your question) the position IDs of tokens in StreamingLLM are "shifted". To give a toy example: 4 attention sink tokens, window size of 6, and the text is just a space separated alphabet, then the model sees:
    A
    A B
    A B C
    A B C D
    A B C D E
    A B C D E F
    A B C D E F G
    A B C D E F G H 
    A B C D E F G H I
    A B C D E F G H I J
    A B C D F G H I J K
    A B C D G H I J K L
    A B C D H I J K L M
    ...

    With these position IDs:

    0
    0 1
    0 1 2
    0 1 2 3
    0 1 2 3 4
    0 1 2 3 4 5
    0 1 2 3 4 5 6
    0 1 2 3 4 5 6 7
    0 1 2 3 4 5 6 7 8
    0 1 2 3 4 5 6 7 8 9
    0 1 2 3 4 5 6 7 8 9
    0 1 2 3 4 5 6 7 8 9
    0 1 2 3 4 5 6 7 8 9
    ...

    i.e. the position IDs get shifted (or rather, they don't get shifted) as the window moves.

Or from the paper itself (Section 3.2, page 5):

When determining the relative distance and adding positional information to tokens, StreamingLLM focuses on positions within the cache rather than those in the original text. This distinction is crucial for StreamingLLM’s performance. For instance, if the current cache has tokens [0, 1, 2, 3, 6, 7, 8] and is in the process of decoding the 9th token, the positions assigned are [0, 1, 2, 3, 4, 5, 6, 7], rather than the positions in the original text, which would be [0, 1, 2, 3, 6, 7, 8, 9].

As a direct result of this, the crucial initial tokens are always "closer" to the "current" tokens, allowing these initial tokens to have a larger effect on the current tokens. I think this is why the Dense Attention models fail at larger sequence lengths (but this is purely my hypothesis, an author may correct me).

  1. Regarding Figure 1-(b), the PPL of Window Attention seems the lowest, but it has an 'X' label. Maybe the PPL score of Window Attention should be larger in real, or it shouldn't take 'X' label?

Perhaps I'm looking at the wrong figure, but Figure 1 (b) shows a really high PPL of 5158 for me. This is also corroborated in Figure 3, where the PPL for Window Attention is always larger than the PPL for StreamingLLM. Could you elaborate here?

llsj14 commented 1 year ago

@tomaarsen Thank you for your comments! Your examples and experimental results have provided me with a better understanding.

  1. Yeah, so I understood that Dense Attention has limitations in the aspect you pointed out, such as memory usage and processing positional IDs related to positional encodings.

  2. I realized that I misinterpreted it as 5.158, but it's actually 5158 which was large enough.