Closed kerlion closed 6 months ago
I found that tiktoken splits a Chinese character into two tokens, is this normal?
Yes, this is expected. There are like 150K unicode characters, so if your vocab size is less than that some Unicode character has to be split into multiple tokens.
I found that tiktoken splits a Chinese character into two tokens, is this normal?