Closed FrederickGeek8 closed 11 months ago
Hi,
Thank you for diving deep into our paper and bringing up this insightful question! Regarding the 0+1024 configuration, it implies that the zero sink may be evicted when the input size surpasses the cache size. Essentially, it's equivalent to training a model that's trained to the SoftMax1 function but executing inference with the standard SoftMax. This discrepancy is what led to the unexpected surge in perplexity. I hope this clarifies your query.
Guangxuan
Hi there, thanks for writing such an interesting paper. When I heard of your paper, I immediately had the thought that it might be related to Evan Miller's Attention is Off By One blog post. And I was right! I was excited to see your experiments but when I came to Table 3 which describes the results of pre-training, from scratch, identical language models corresponding to vanilla Attention, Attention with a Zero Sink, and Attention with a Learnable Sink with 0-4 sink tokens prepended.
Maybe it's because of some strange sense of "moral truth" I have with the Zero Sink, but I was a little surprised that the Zero Sink didn't do better experimentally. But then I looked closer and I noticed your $0 + 1024$ experiments and I was a little confused with the results presented.
In the table description you say
Based on that definition, shouldn't the $0 + 1024$ case with 0 sink tokens make the 3 formulations equivalent? Where do the wildly different perplexity results come from for that experiment? Perhaps I'm misunderstanding the description of this table.
Thank you for fielding my questions!