google-deepmind / gemma

Open weights LLM from Google DeepMind.
http://ai.google.dev/gemma
Apache License 2.0
2.41k stars 305 forks source link

Unused tokens in gemma tokenizer #29

Open hboyar opened 5 months ago

hboyar commented 5 months ago

I am using "google/gemma-2b-it" model from HuggingFace. I realized there are 99 unused tokens (\ ,\,\...) in first 106 token ids. Does anyone know their purpose? Just wondering.

tilakrayal commented 5 months ago

@hboyar, If you iterate through the vocab, you should find some tokens. They weren't used for training, but can be used for any other purpose.

99 unused tokens are reserved in the pretrained tokenizer model to assist with more efficient training/fine-tuning. Unused tokens are in the string format of <unused[0-98]> with a token id range of [7-105]. Also those will assist model developers to more efficiently train/fine-tune Gemma by re-using the reserved tokens rather than attempting to resize the BPE based tokenizer. let us know if this helps. Thank you!

hboyar commented 5 months ago

That's a really cool feature. It would definitely help when creating custom fine tuning tasks. I'll keep this in mind. Thank you!

kishoreKunisetty commented 1 month ago

Hey guys,

I was wondering if there is a way to reassign those unused. for something new token I want to induce to the model. some thing like. to .

Thanks.