Closed hsm1997 closed 9 months ago
Thank you for taking a closer look at our paper and for raising these questions. Let me address them in detail:
Window Attention Clarification: There is no two types of window attention as b-naive-window-attention
and c-recompute-window-attention
. The concept you referred to as c-recompute-window-attention
is actually "sliding window with re-computation", which is not window attention. For an in-depth explanation of how sliding windows with re-computation works, please look at this issue. Essentially, the sliding window with re-computation is a form of dense attention applied to truncated text. Both Table 1 and 5 employ window attention, while Figure 10 specifically uses the sliding window with re-computation. The distinctions were mentioned explicitly in the respective captions.
Streaming and Token Generation: While prompts are indeed processed individually, when streaming is disabled, all prior inputs are retained. Consequently, if the length of cached text surpasses the model's pre-trained chunk size, there's a degradation in model performance, leading to the generation of erroneous tokens.
Evaluation Pipeline: The exact evaluation sequence follows the pattern: [q1, a1, q2, a2, ...].
I hope my clarifications have helped you better understand your questions. Feel free to let me know if there's anything else I can help with.
Guangxuan
Thanks again for your reply! Really helped me understands better.
c-recompute-window-attention
behaves close to streaming-llm on ppl, but table1 says that "window" attention has poor performance on ppl, so I guess table1 usesb-naive-window-attention
? And table5 says that "window" attention fails in ARC benchmark, so I guess this is alsob-naive-window-attention
? Then in figure10, it says that the speedup is benchmarked withc-recompute-window-attention
. Could you benchmark ALL results with BOTH "window-attention" methods to make the comparison fair? Or did I miss anything?Thanks in advance!