mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.59k stars 361 forks source link

I'm (A Bit) Suspicious of Table 3. #44

Closed FrederickGeek8 closed 11 months ago

FrederickGeek8 commented 11 months ago

Hi there, thanks for writing such an interesting paper. When I heard of your paper, I immediately had the thought that it might be related to Evan Miller's Attention is Off By One blog post. And I was right! I was excited to see your experiments but when I came to Table 3 which describes the results of pre-training, from scratch, identical language models corresponding to vanilla Attention, Attention with a Zero Sink, and Attention with a Learnable Sink with 0-4 sink tokens prepended.

Maybe it's because of some strange sense of "moral truth" I have with the Zero Sink, but I was a little surprised that the Zero Sink didn't do better experimentally. But then I looked closer and I noticed your $0 + 1024$ experiments and I was a little confused with the results presented.

In the table description you say

Cache config x+y denotes adding x initial tokens with y recent tokens.

Based on that definition, shouldn't the $0 + 1024$ case with 0 sink tokens make the 3 formulations equivalent? Where do the wildly different perplexity results come from for that experiment? Perhaps I'm misunderstanding the description of this table.

Thank you for fielding my questions!

Guangxuan-Xiao commented 11 months ago

Hi,

Thank you for diving deep into our paper and bringing up this insightful question! Regarding the 0+1024 configuration, it implies that the zero sink may be evicted when the input size surpasses the cache size. Essentially, it's equivalent to training a model that's trained to the SoftMax1 function but executing inference with the standard SoftMax. This discrepancy is what led to the unexpected surge in perplexity. I hope this clarifies your query.

Guangxuan