I'm (A Bit) Suspicious of Table 3.

Hi there, thanks for writing such an interesting paper. When I heard of your paper, I immediately had the thought that it might be related to Evan Miller's Attention is Off By One blog post. And I was right! I was excited to see your experiments but when I came to Table 3 which describes the results of pre-training, from scratch, identical language models corresponding to vanilla Attention, Attention with a Zero Sink, and Attention with a Learnable Sink with 0-4 sink tokens prepended.

Maybe it's because of some strange sense of "moral truth" I have with the Zero Sink, but I was a little surprised that the Zero Sink didn't do better experimentally. But then I looked closer and I noticed your $0 + 1024$ experiments and I was a little confused with the results presented.

In the table description you say

Cache config x+y denotes adding x initial tokens with y recent tokens.

Based on that definition, shouldn't the $0 + 1024$ case with 0 sink tokens make the 3 formulations equivalent? Where do the wildly different perplexity results come from for that experiment? Perhaps I'm misunderstanding the description of this table.

Thank you for fielding my questions!

mit-han-lab / streaming-llm

I'm (A Bit) Suspicious of Table 3. #44