mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Question about long input and difference between streaming-llm and dense attention. #41

Closed hxs91 closed 9 months ago

hxs91 commented 9 months ago

Thank you for your nice work and I have read the issue #33, thank you for your patient explanation for the difference between StreamingLLM and Dense Attention, based on your answer, I have futher question:

  1. As you mentioned in FAQ.3, I guess streaming-llm processes long input in a truncation manner. But if I don't mind the expensive time consuming and handle the long input text in a streaming-llm manner (recompuation from the very begining of the text to the end with attention sinks window by window), theoretically, will it perform better than truncation? do you perform some experiments?

  2. For a model with large context size (say 16k), we can still implement streaming llm use shorter context size(say 4k), have you perform comparison between dense attention with 16k and streaming llm with 4k? Theoretically, streaming-4k should perform worse than dense-16k, what's the gap? this is important if one want to use the streaming-llm to approximate larger window size performance.

Guangxuan-Xiao commented 9 months ago

Hello,

Thank you for your thoughtful questions. Let's delve into the details:

  1. Regarding streaming-llm processing with truncation vs. re-computation:

    • In our paper, we have results that touch on this topic. The baseline you're referring to is the "sliding window with re-computation." If you refer to Figure 3 of our paper, you'll see that StreamingLLM's perplexity is in line with this baseline. So, in essence, StreamingLLM performs comparably when handling long inputs in the manner you described.
  2. Comparison between dense attention at 16k and StreamingLLM at 4k:

    • Firstly, dense attention is designed to function within its pre-training range. So, for the example you provided, it operates within the 0-16K range.
    • Within the [0, 4k] range, dense attention and StreamingLLM have equivalent perplexity scores.
    • For the range [4K, 16K], dense attention is likely to outperform StreamingLLM because it retains information from previous tokens, thus having a broader context. This gap should be more apparent when the relevant information is evicted from the window of StreamingLLM.
    • Beyond the 16K mark (i.e., [16K, ∞]), the dense attention model won't be operational, whereas StreamingLLM will continue to function.

I hope this helps explain the questions you asked.

Guangxuan

hxs91 commented 9 months ago

@Guangxuan-Xiao Got it. thank you for the answer. btw, hope to see some quantitatively results of the gap in question 2. : )