Spark's large language model token calculation error

13006319571 commented 1 week ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.7.2

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

The first step is to create a chat app
The second step is to write an 800-word essay about my mother
Step 3: Check the number of tokens
![Uploading lQLPKHAmPJ_nM5fNBF3NB7qwQWmsGwhW100GvxXajNsaAA_1978_1117.png…]()

✔️ Expected Behavior

The actual number of tokens may be several hundred

❌ Actual Behavior

The number of tokens calculated by the platform is only 18

dosubot[bot] commented 1 week ago

Hello @13006319571! I'm here to assist you with any bugs, questions, or contributions you have for the repository. Let's work together to solve any issues you're encountering.

I found a similar unsolved discussion that might be relevant to your issue: Max Token Limits Incorrect?, which was last updated on 2024-08-14T04:25:50Z ^[1].

Regarding the token calculation in Dify, the platform uses an approximation based on the GPT-2 tokenizer. The method get_num_tokens sums up the token counts for each text in the input list by calling _get_num_tokens_by_gpt2. Here is the relevant code snippet:

def get_num_tokens(self, model: str, credentials: dict, texts: list[str]) -> int:
    """
    Approximate number of tokens for given messages using GPT2 tokenizer
    :param model: model name
    :param credentials: model credentials
    :param texts: texts to embed
    :return:
    """
    return sum(self._get_num_tokens_by_gpt2(text) for text in texts)

Additionally, Dify provides a method for token calculation in the BaichuanTokenizer class, which can be used for texts containing both Chinese and English characters. This method estimates the number of tokens by counting Chinese characters and English words, applying a specific formula for the estimation:

class BaichuanTokenizer:
    @classmethod
    def count_chinese_characters(cls, text: str) -> int:
        return len(re.findall(r'[\u4e00-\u9fa5]', text))

    @classmethod
    def count_english_vocabularies(cls, text: str) -> int:
        # remove all non-alphanumeric characters but keep spaces and other symbols like !, ., etc.
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        # count the number of words not characters
        return len(text.split())

    @classmethod
    def _get_num_tokens(cls, text: str) -> int:
        # tokens = number of Chinese characters + number of English words * 1.3 (for estimation only, subject to actual return)
        return int(cls.count_chinese_characters(text) + cls.count_english_vocabularies(text) * 1.3)

These methods ensure that the token count is accurate by considering all elements of the input, including any tool or function calls that may be present ^[2]^[3]^[4].

It seems there might be an issue with the token calculation in your specific case. You might want to check if the text is being processed correctly and if the tokenizer is applied as expected.

To continue talking to Dosu, mention @dosu.