Closed dhcode-cpp closed 8 months ago
Yes, we should add a new token into the tokenizer vocabulary and the embedding layer if you want to add a dedicated sink token during pre-training.
Guangxuan
Yes, we should add a new token into the tokenizer vocabulary and the embedding layer if you want to add a dedicated sink token during pre-training.
Guangxuan
@Guangxuan-Xiao Thx for your replay! I recently write a technical analysis about StreamingLLM. Could we connect on WeChat?
hello
Streaming-LLM is very efficient model.
I take a lot of time to read paper and debug code, but I think that the starting sink token is a uniform learnable token.
This lack of a uniform starting token leads the model to use several initial tokens as attention sinks. We hypothesize that by incorporating a stable learnable token at the start of all training samples
If sink token is a new token, should we change embedding weight or modify tokenizer?