some question about paper

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

MIT License

6.38k stars 355 forks source link

Hello author, I have a few questions about your article and I hope to get your help: @Guangxuan-Xiao

If the K-1 tokens before the current token are truncated each time (if the maximum context window is K), and the sequence is re-calculated for attention, will the PPL not suddenly increase a lot?
In theory, when maintaining the maximum sequence length, if there are 4 tokens in front as anchor points, inputting a new token will only crowd out the first token. There are still 3 tokens behind, and ppl will decrease, but why does it increase so much?
I would like to confirm whether streaming llm only discards the kv of the intermediate Evicted token in the normal inference KV cache. Therefore, it is actually an improvement on the normal KV cache to accelerate inference. In the case of continuous input, there is no need to clear the cache. recalculate

Hello,

Thank you for reaching out with your questions. I'd be happy to clarify.

The effect on perplexity depends on the window size, K. Consider a large window size like 2048. The difference in perplexity when predicting the next token based on the previous 2048 tokens, as opposed to the preceding 2047 tokens, is marginal. So, the PPL will not have much difference.
Not all the 4 front tokens hold equivalent significance in the attention mechanism. The very first token typically carries the highest importance. Removing it can cause a significant performance drop in the model. Please take a look at Figure 2 in our paper for a detailed visualization of this effect.
If I understand correctly, you're comparing StreamingLLM and the dense attention baseline. StreamingLLM does more than just enhance inference speed. When you rely solely on dense attention, which retains all historical KVs, the model's perplexity would drastically increase for texts surpassing the pre-training window length. This is because the model struggles to generalize to more extended sequences. Therefore, StreamingLLM optimizes both efficiency and language model performance.

I hope this clarifies your questions.

Guangxuan

mit-han-lab / streaming-llm