Tokens Removed from Trained Custom BPE Tokenizer

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.92k stars 776 forks source link

Tokens Removed from Trained Custom BPE Tokenizer #1516

Closed rteehas closed 5 months ago

rteehas commented 5 months ago

Hi,

I've trained a custom BPE tokenizer on unicode strings with an initial starting vocab. I noticed, however, that some tokens from the starting vocab don't appear in the vocab of the trained BPE tokenizer. Is this expected? For example, if those tokens did not appear when training the tokenizer, will they be removed? Is there a way to force the tokenizer to preserve the starting vocab?