Closed Qubitium closed 7 months ago
Yes, there are! If you iterate through the vocab, you should find some <unusedXX>
tokens. They weren't used for training, but can be used for any other purpose. I think there's around 90 or so of these tokens, let us know if this helps.
Thank you! Exactly what we are looking for.
Are there any pointers or guidelines on how we can make use of these <unusedXX>
tokens?
How can one make use of them while finetuning?
Due to BPE vocabulary unable to dynamically expand after training, for finetuning, some BPE tokenizer based models such as Qwen reserved 2k extra unused tokens at the end for developers to use as they see fit.
Does Gemma have a list of internally unused tokens?
Sometimes model makers resize a vocab to a nice gpu-friendly multiple which creates unused tokens or intentially leave some unused tokens such as Qwen.