mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Question about evaluation results and demo #39

Closed hsm1997 closed 9 months ago

hsm1997 commented 9 months ago
  1. I found the concept of "window attention" confusing. In fig1 there are two types of window attention, b(naive window attention) and c(recompute window attention). Fig.3 shows that c-recompute-window-attention behaves close to streaming-llm on ppl, but table1 says that "window" attention has poor performance on ppl, so I guess table1 uses b-naive-window-attention? And table5 says that "window" attention fails in ARC benchmark, so I guess this is also b-naive-window-attention? Then in figure10, it says that the speedup is benchmarked with c-recompute-window-attention. Could you benchmark ALL results with BOTH "window-attention" methods to make the comparison fair? Or did I miss anything?
  2. Looking into your demo video and https://github.com/mit-han-lab/streaming-llm/blob/main/examples/run_streaming_llama.py , I don't quite understand why the model generates erroneous tokens(when "model performance breaks") if streaming is not enabled. Since the prompts are actually processed by the model one-by-one (#L63), I suppose the model should be either OOM or "generating good tokens". Where does the erroneous tokens come from?
  3. What is the exact pipeline of ARC evaluation (table 5)? Does the model "process q1 -> generate a1 with evicted past cache of q1 -> process q2 with evicted past cache of q1 and a1 -> generate a2 with evicted past cache of q1 and a1 and q2-> ..." (which is what run_streaming_llama.py do), or "process [q1, q2, q3... qn] -> generate [a1, a2, a3, ..., an]"?

Thanks in advance!

Guangxuan-Xiao commented 9 months ago

Thank you for taking a closer look at our paper and for raising these questions. Let me address them in detail:

  1. Window Attention Clarification: There is no two types of window attention as b-naive-window-attention and c-recompute-window-attention. The concept you referred to as c-recompute-window-attention is actually "sliding window with re-computation", which is not window attention. For an in-depth explanation of how sliding windows with re-computation works, please look at this issue. Essentially, the sliding window with re-computation is a form of dense attention applied to truncated text. Both Table 1 and 5 employ window attention, while Figure 10 specifically uses the sliding window with re-computation. The distinctions were mentioned explicitly in the respective captions.

  2. Streaming and Token Generation: While prompts are indeed processed individually, when streaming is disabled, all prior inputs are retained. Consequently, if the length of cached text surpasses the model's pre-trained chunk size, there's a degradation in model performance, leading to the generation of erroneous tokens.

  3. Evaluation Pipeline: The exact evaluation sequence follows the pattern: [q1, a1, q2, a2, ...].

I hope my clarifications have helped you better understand your questions. Feel free to let me know if there's anything else I can help with.

Guangxuan

hsm1997 commented 9 months ago

Thanks again for your reply! Really helped me understands better.