openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.48k stars 856 forks source link

A character is splited into two tokens #294

Closed kerlion closed 6 months ago

kerlion commented 6 months ago

I found that tiktoken splits a Chinese character into two tokens, is this normal?

hauntsaninja commented 6 months ago

Yes, this is expected. There are like 150K unicode characters, so if your vocab size is less than that some Unicode character has to be split into multiple tokens.