mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

why starting sink token is not a special token '\n'? #62

Closed dhcode-cpp closed 8 months ago

dhcode-cpp commented 8 months ago

hello

Streaming-LLM is very efficient model.

I take a lot of time to read paper and debug code, but I think that the starting sink token is a uniform learnable token.

This lack of a uniform starting token leads the model to use several initial tokens as attention sinks. We hypothesize that by incorporating a stable learnable token at the start of all training samples

If sink token is a new token, should we change embedding weight or modify tokenizer?

Guangxuan-Xiao commented 8 months ago

Yes, we should add a new token into the tokenizer vocabulary and the embedding layer if you want to add a dedicated sink token during pre-training.

Guangxuan

dhcode-cpp commented 8 months ago

Yes, we should add a new token into the tokenizer vocabulary and the embedding layer if you want to add a dedicated sink token during pre-training.

Guangxuan

@Guangxuan-Xiao Thx for your replay! I recently write a technical analysis about StreamingLLM. Could we connect on WeChat?