Questions Regarding "Sink Tokens"

Hi! Thank you for you interesting paper and its implementation! I have a few questions I hope you can clarify:

When employing the pre-trained model with a "sink token," is this token also prepended to the input during inference? If so, could you explain why Figure 7 presents visualizations with identical token lengths between two models? If not, is the added trainable "sink token" identitcal or functionally equivalent to each model's bos token (e.g. <s>) ensuring compatibility between inference and the training corpus?
The ablation study on the number of initial tokens suggests that incorporating just one initial token still yields reasonable results(?) for most models, except perhaps for the llama2. Considering this, if four initial tokens are optimal, have you experimented with training models using four additional "sink tokens" to align with this assumption?

Btw my own research also touches on the role of initial tokens in LLMs and I find your findings to be quite complementary to my experiment results. I would be delighted to discuss more on this if you are interested, and good luck with your iclr result :)

mit-han-lab / streaming-llm

Questions Regarding "Sink Tokens" #65