openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.06k stars 749 forks source link

A character is splited into two tokens #294

Closed kerlion closed 1 month ago

kerlion commented 1 month ago

I found that tiktoken splits a Chinese character into two tokens, is this normal?

hauntsaninja commented 1 month ago

Yes, this is expected. There are like 150K unicode characters, so if your vocab size is less than that some Unicode character has to be split into multiple tokens.