Open hboyar opened 5 months ago
@hboyar,
If you iterate through the vocab, you should find some
99 unused tokens are reserved in the pretrained tokenizer model to assist with more efficient training/fine-tuning. Unused tokens are in the string format of <unused[0-98]> with a token id range of [7-105]. Also those will assist model developers to more efficiently train/fine-tune Gemma by re-using the reserved tokens rather than attempting to resize the BPE based tokenizer. let us know if this helps. Thank you!
That's a really cool feature. It would definitely help when creating custom fine tuning tasks. I'll keep this in mind. Thank you!
Hey guys,
I was wondering if there is a way to reassign those unused. for something new token I want to induce to the model.
some thing like.
Thanks.
I am using "google/gemma-2b-it" model from HuggingFace. I realized there are 99 unused tokens (\ ,\,\...) in first 106 token ids. Does anyone know their purpose? Just wondering.