Closed caopengan closed 1 year ago
You are using the wrong encoding, please take a look at my answer in this issue: https://github.com/knuddelsgmbh/jtokkit/issues/19#issue-1681077940
You are using the wrong encoding, please take a look at my answer in this issue: #19 (comment)
I noticed that the input and output of name
of role
and message
need to be added to the token calculation, but the number I calculated in this way will actually be 1-2 more each time. When it is removed, the opposite is true. It is correct (consistent with the interface's usage
return).
Has this logic changed? Or am I doing something wrong? I use gpt-3.5-turbo
.
Have you tried this: https://jtokkit.knuddels.de/docs/getting-started/recipes/chatml ?
It is based on the official OpenAI cookbook on counting tokens in chatml format: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
Edit: It may be possible, that something has changed with the recent function invocation release of OpenAI, see this documentation: https://platform.openai.com/docs/guides/gpt/managing-tokens
I have not yet looked into this, but on a cursory glance it seems that this line differs from the previous calculation:
num_tokens += 2 # every reply is primed with <im_start>assistant
instead of the previous
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
Have you tried this: https://jtokkit.knuddels.de/docs/getting-started/recipes/chatml ?
It is based on the official OpenAI cookbook on counting tokens in chatml format: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
Edit: It may be possible, that something has changed with the recent function invocation release of OpenAI, see this documentation: https://platform.openai.com/docs/guides/gpt/managing-tokens
I have not yet looked into this, but on a cursory glance it seems that this line differs from the previous calculation:
num_tokens += 2 # every reply is primed with <im_start>assistant
instead of the previous
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
Sure, I tried. But it seems wrong.
When I only input prompt
for calculation, it is consistent with usage.prompt_tokens
returned by the interface, and so is completion
. However, after using this method, it will result in more Token calculations than the official returns (the code is written in this way).
I also encountered the problem of a different number of tokens. Empirically found out that spaces, indents and hyphens are also taken into account in the openAI tokenization window. But this tokenizer does not have
Have you tried this: https://jtokkit.knuddels.de/docs/getting-started/recipes/chatml ?
It is based on the official OpenAI cookbook on counting tokens in chatml format: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
Edit: It may be possible, that something has changed with the recent function invocation release of OpenAI, see this documentation: https://platform.openai.com/docs/guides/gpt/managing-tokens
I have not yet looked into this, but on a cursory glance it seems that this line differs from the previous calculation:
num_tokens += 2 # every reply is primed with <im_start>assistant
instead of the previous
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
yeh,Solved!!! thank you, good job,hhhh……
hi,When I used the Chinese count, it was different from the token count displayed on openai's official website。 It's OpenAI![image](https://github.com/knuddelsgmbh/jtokkit/assets/12568699/3a445953-3751-487d-a936-3d80cc88a9ac)
It's jtokkit count![image](https://github.com/knuddelsgmbh/jtokkit/assets/12568699/281059e9-7344-4c75-8188-67c621815105)