tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Questions Related to the Application and Results of Attention Sinks After the Paper #28

Closed dsdanielpark closed 8 months ago

dsdanielpark commented 8 months ago

Questions Related to the Application and Results of Attention Sinks After the Paper

  1. Hello, I was deeply impressed by your paper. I thought that many models would apply attention sinks since the issue with the initial token receiving a disproportionate amount of weight was resolved. However, it seems that even after some time has passed, they are not being applied as much as I expected. May I ask what the authors think might be the reason for this?

  2. I am curious whether it is better to apply attention sinks during model training or model inference, and whether there has been any performance degradation verified after the paper. In fact, I do not intuitively expect a significant improvement in speed overall, but I wonder if performance should not be slightly higher. Alternatively, I also think that intuitively, giving more weight to the early parts of a sentence might be a method to enhance the overall understanding of the sentence.

Therefore, I am curious about how the authors' thoughts have changed after the paper.

tomaarsen commented 8 months ago

Hello!

First of all, I want to point out that I'm not one of the paper authors! The official GitHub for the paper is: https://github.com/mit-han-lab/streaming-llm Feel free to copy your issue to there!

However, I can try to answer the questions myself as I'm definitely knowledgeable on this as well.

  1. I agree with you here - I expected this to very quickly be adapted by practitioners. My theory is that it's not commonly used as not many people care for longer "fluency" - people mostly care about longer context lengths, which is not something that attention sinks provides. However, transformers is working on an implementation: https://github.com/huggingface/transformers/pull/26681

  2. I can't say for certain, as I've only applied it during model inference. My experiments show that the inference speed is higher than full/dense attention after more tokens have been generated than the window size. However, there is a very slight loss in perplexity, intuitively meaning a slight loss in understanding.

dsdanielpark commented 8 months ago

tomaarsen

Thank you for your invaluable insights. It's been a great help.

I also plan to apply attention sinks in my reasoning process. The key point seems to be the potential difference in information loss between documents like abstracts, where tokens at the beginning contain crucial information, and standard formats like insurance policies that just meet the necessary criteria.

It appears evident that speed and performance involve a certain degree of trade-off.

Once again, thank you for the wonderful project and your opinions. I'll come back with more questions if I have any.