Open 13006319571 opened 1 week ago
Hello @13006319571! I'm here to assist you with any bugs, questions, or contributions you have for the repository. Let's work together to solve any issues you're encountering.
I found a similar unsolved discussion that might be relevant to your issue: Max Token Limits Incorrect?, which was last updated on 2024-08-14T04:25:50Z [1].
Regarding the token calculation in Dify, the platform uses an approximation based on the GPT-2 tokenizer. The method get_num_tokens
sums up the token counts for each text in the input list by calling _get_num_tokens_by_gpt2
. Here is the relevant code snippet:
def get_num_tokens(self, model: str, credentials: dict, texts: list[str]) -> int:
"""
Approximate number of tokens for given messages using GPT2 tokenizer
:param model: model name
:param credentials: model credentials
:param texts: texts to embed
:return:
"""
return sum(self._get_num_tokens_by_gpt2(text) for text in texts)
Additionally, Dify provides a method for token calculation in the BaichuanTokenizer
class, which can be used for texts containing both Chinese and English characters. This method estimates the number of tokens by counting Chinese characters and English words, applying a specific formula for the estimation:
class BaichuanTokenizer:
@classmethod
def count_chinese_characters(cls, text: str) -> int:
return len(re.findall(r'[\u4e00-\u9fa5]', text))
@classmethod
def count_english_vocabularies(cls, text: str) -> int:
# remove all non-alphanumeric characters but keep spaces and other symbols like !, ., etc.
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# count the number of words not characters
return len(text.split())
@classmethod
def _get_num_tokens(cls, text: str) -> int:
# tokens = number of Chinese characters + number of English words * 1.3 (for estimation only, subject to actual return)
return int(cls.count_chinese_characters(text) + cls.count_english_vocabularies(text) * 1.3)
These methods ensure that the token count is accurate by considering all elements of the input, including any tool or function calls that may be present [2][3][4].
It seems there might be an issue with the token calculation in your specific case. You might want to check if the text is being processed correctly and if the tokenizer is applied as expected.
To continue talking to Dosu, mention @dosu.
This image is broken.
This image is broken.
It is indeed incorrect.
This image is broken.
There may be an error here. https://github.com/langgenius/dify/blob/0cef25ef8c3892e50fb47395a16b8096e7685511/api/core/model_runtime/model_providers/spark/llm/llm.py#L50
Self Checks
Dify version
0.7.2
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
![Uploading lQLPKHAmPJ_nM5fNBF3NB7qwQWmsGwhW100GvxXajNsaAA_1978_1117.png…]()
✔️ Expected Behavior
The actual number of tokens may be several hundred
❌ Actual Behavior
The number of tokens calculated by the platform is only 18