Questions Related to the Application and Results of Attention Sinks After the Paper

Hello, I was deeply impressed by your paper.

Hello, I was deeply impressed by your paper. I thought that many models would apply attention sinks since the issue with the initial token receiving a disproportionate amount of weight was resolved. However, it seems that even after some time has passed, they are not being applied as much as I expected. May I ask what the authors think might be the reason for this?
I am curious whether it is better to apply attention sinks during model training or model inference, and whether there has been any performance degradation verified after the paper. In fact, I do not intuitively expect a significant improvement in speed overall, but I wonder if performance should not be slightly higher. Alternatively, I also think that intuitively, giving more weight to the early parts of a sentence might be a method to enhance the overall understanding of the sentence.
Ultimately, the main point seems to be that it has addressed the issue of high initial layer weight distribution, but I'm curious why it's not universally used. I wonder if sink attention, which disperses not just the initial layers but across the whole, can maintain performance while improving speed, and how it can be best utilized.

Therefore, I am curious about how the authors' thoughts have changed after the paper.

Thank you! :)

mit-han-lab / streaming-llm