knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

Chinese characters #19

Closed jiangyklala closed 1 year ago

jiangyklala commented 1 year ago

Hello, it seems that there is a slight difference in the calculation of Chinese characters between this project and the Tokenizer on the official website. The model I use is gpt-3.5-turbo, and the following are two comparison pictures:

tox-p commented 1 year ago

Do you mean this tokenizer: https://platform.openai.com/tokenizer ?

The above linked tokenizer uses r50k_base as encoding, while gpt-3.5-turbo uses cl100k_base as encoding.

Try this one, mentioned in this tiktoken FAQ, for your comparison: https://tiktokenizer.vercel.app/ (make sure to use the textbox input and not the message input if comparing the encoding of a raw string like in your screen)

jiangyklala commented 1 year ago

Solved ! Thanks for answering and contributing such a good library !