knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

Count is different #30

Closed caopengan closed 1 year ago

caopengan commented 1 year ago

hi,When I used the Chinese count, it was different from the token count displayed on openai's official website。 It's OpenAI image

It's jtokkit count image

tox-p commented 1 year ago

You are using the wrong encoding, please take a look at my answer in this issue: https://github.com/knuddelsgmbh/jtokkit/issues/19#issue-1681077940

ReberMusk commented 1 year ago

You are using the wrong encoding, please take a look at my answer in this issue: #19 (comment)

I noticed that the input and output of name of role and message need to be added to the token calculation, but the number I calculated in this way will actually be 1-2 more each time. When it is removed, the opposite is true. It is correct (consistent with the interface's usage return). Has this logic changed? Or am I doing something wrong? I use gpt-3.5-turbo.

tox-p commented 1 year ago

Have you tried this: https://jtokkit.knuddels.de/docs/getting-started/recipes/chatml ?

It is based on the official OpenAI cookbook on counting tokens in chatml format: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Edit: It may be possible, that something has changed with the recent function invocation release of OpenAI, see this documentation: https://platform.openai.com/docs/guides/gpt/managing-tokens

I have not yet looked into this, but on a cursory glance it seems that this line differs from the previous calculation:

num_tokens += 2  # every reply is primed with <im_start>assistant

instead of the previous

num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
ReberMusk commented 1 year ago

Have you tried this: https://jtokkit.knuddels.de/docs/getting-started/recipes/chatml ?

It is based on the official OpenAI cookbook on counting tokens in chatml format: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Edit: It may be possible, that something has changed with the recent function invocation release of OpenAI, see this documentation: https://platform.openai.com/docs/guides/gpt/managing-tokens

I have not yet looked into this, but on a cursory glance it seems that this line differs from the previous calculation:

num_tokens += 2  # every reply is primed with <im_start>assistant

instead of the previous

num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>

Sure, I tried. But it seems wrong. When I only input prompt for calculation, it is consistent with usage.prompt_tokens returned by the interface, and so is completion. However, after using this method, it will result in more Token calculations than the official returns (the code is written in this way).

Krobys commented 1 year ago

I also encountered the problem of a different number of tokens. Empirically found out that spaces, indents and hyphens are also taken into account in the openAI tokenization window. But this tokenizer does not have

caopengan commented 1 year ago

Have you tried this: https://jtokkit.knuddels.de/docs/getting-started/recipes/chatml ?

It is based on the official OpenAI cookbook on counting tokens in chatml format: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Edit: It may be possible, that something has changed with the recent function invocation release of OpenAI, see this documentation: https://platform.openai.com/docs/guides/gpt/managing-tokens

I have not yet looked into this, but on a cursory glance it seems that this line differs from the previous calculation:

num_tokens += 2  # every reply is primed with <im_start>assistant

instead of the previous

num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>

yeh,Solved!!! thank you, good job,hhhh……