mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Questions Regarding "Sink Tokens" #65

Open clarenceluo78 opened 8 months ago

clarenceluo78 commented 8 months ago

Hi! Thank you for you interesting paper and its implementation! I have a few questions I hope you can clarify:

  1. When employing the pre-trained model with a "sink token," is this token also prepended to the input during inference? If so, could you explain why Figure 7 presents visualizations with identical token lengths between two models? If not, is the added trainable "sink token" identitcal or functionally equivalent to each model's bos token (e.g. <s>) ensuring compatibility between inference and the training corpus?
  2. The ablation study on the number of initial tokens suggests that incorporating just one initial token still yields reasonable results(?) for most models, except perhaps for the llama2. Considering this, if four initial tokens are optimal, have you experimented with training models using four additional "sink tokens" to align with this assumption?

Btw my own research also touches on the role of initial tokens in LLMs and I find your findings to be quite complementary to my experiment results. I would be delighted to discuss more on this if you are interested, and good luck with your iclr result :)