Document the existence of 99 unused tokens in the tokenizer

google / gemma_pytorch

The official PyTorch implementation of Google's Gemma models

https://ai.google.dev/gemma

Apache License 2.0

5.19k stars 492 forks source link

Document the existence of 99 unused tokens in the tokenizer #44

Closed Qubitium closed 4 months ago

Qubitium commented 4 months ago

Add notes to assist model developers to more efficiently train/fine-tune Gemma by re-using the reserved tokens rather than attempting to resizing the BPE based tokenizer.

ref: https://github.com/google/gemma_pytorch/issues/12

@pengchongjin @suryabhupa

pengchongjin commented 4 months ago

Cool, thanks!