huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 776 forks source link

Tokens Removed from Trained Custom BPE Tokenizer #1516

Closed rteehas closed 5 months ago

rteehas commented 5 months ago

Hi,

I've trained a custom BPE tokenizer on unicode strings with an initial starting vocab. I noticed, however, that some tokens from the starting vocab don't appear in the vocab of the trained BPE tokenizer. Is this expected? For example, if those tokens did not appear when training the tokenizer, will they be removed? Is there a way to force the tokenizer to preserve the starting vocab?