google / gemma_pytorch

The official PyTorch implementation of Google's Gemma models
https://ai.google.dev/gemma
Apache License 2.0
5.26k stars 503 forks source link

Are there reserved/unused tokens for developers? #12

Closed Qubitium closed 7 months ago

Qubitium commented 7 months ago

Due to BPE vocabulary unable to dynamically expand after training, for finetuning, some BPE tokenizer based models such as Qwen reserved 2k extra unused tokens at the end for developers to use as they see fit.

Does Gemma have a list of internally unused tokens?

Sometimes model makers resize a vocab to a nice gpu-friendly multiple which creates unused tokens or intentially leave some unused tokens such as Qwen.

suryabhupa commented 7 months ago

Yes, there are! If you iterate through the vocab, you should find some <unusedXX> tokens. They weren't used for training, but can be used for any other purpose. I think there's around 90 or so of these tokens, let us know if this helps.

Qubitium commented 7 months ago

Thank you! Exactly what we are looking for.

HrushikeshPawar commented 5 months ago

Are there any pointers or guidelines on how we can make use of these <unusedXX> tokens? How can one make use of them while finetuning?