langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
45.6k stars 6.41k forks source link

Spark's large language model token calculation error #7911

Open 13006319571 opened 1 week ago

13006319571 commented 1 week ago

Self Checks

Dify version

0.7.2

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

  1. The first step is to create a chat app
  2. The second step is to write an 800-word essay about my mother
  3. Step 3: Check the number of tokens
  4. ![Uploading lQLPKHAmPJ_nM5fNBF3NB7qwQWmsGwhW100GvxXajNsaAA_1978_1117.png…]()

✔️ Expected Behavior

The actual number of tokens may be several hundred

❌ Actual Behavior

The number of tokens calculated by the platform is only 18

dosubot[bot] commented 1 week ago

Hello @13006319571! I'm here to assist you with any bugs, questions, or contributions you have for the repository. Let's work together to solve any issues you're encountering.

I found a similar unsolved discussion that might be relevant to your issue: Max Token Limits Incorrect?, which was last updated on 2024-08-14T04:25:50Z [1].

Regarding the token calculation in Dify, the platform uses an approximation based on the GPT-2 tokenizer. The method get_num_tokens sums up the token counts for each text in the input list by calling _get_num_tokens_by_gpt2. Here is the relevant code snippet:

def get_num_tokens(self, model: str, credentials: dict, texts: list[str]) -> int:
    """
    Approximate number of tokens for given messages using GPT2 tokenizer
    :param model: model name
    :param credentials: model credentials
    :param texts: texts to embed
    :return:
    """
    return sum(self._get_num_tokens_by_gpt2(text) for text in texts)

Additionally, Dify provides a method for token calculation in the BaichuanTokenizer class, which can be used for texts containing both Chinese and English characters. This method estimates the number of tokens by counting Chinese characters and English words, applying a specific formula for the estimation:

class BaichuanTokenizer:
    @classmethod
    def count_chinese_characters(cls, text: str) -> int:
        return len(re.findall(r'[\u4e00-\u9fa5]', text))

    @classmethod
    def count_english_vocabularies(cls, text: str) -> int:
        # remove all non-alphanumeric characters but keep spaces and other symbols like !, ., etc.
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        # count the number of words not characters
        return len(text.split())

    @classmethod
    def _get_num_tokens(cls, text: str) -> int:
        # tokens = number of Chinese characters + number of English words * 1.3 (for estimation only, subject to actual return)
        return int(cls.count_chinese_characters(text) + cls.count_english_vocabularies(text) * 1.3)

These methods ensure that the token count is accurate by considering all elements of the input, including any tool or function calls that may be present [2][3][4].

It seems there might be an issue with the token calculation in your specific case. You might want to check if the text is being processed correctly and if the tokenizer is applied as expected.

To continue talking to Dosu, mention @dosu.

crazywoola commented 1 week ago

This image is broken.

AAEE86 commented 1 week ago

This image is broken.

QQ截图20240907153253

It is indeed incorrect.

AAEE86 commented 1 week ago

This image is broken.

There may be an error here. https://github.com/langgenius/dify/blob/0cef25ef8c3892e50fb47395a16b8096e7685511/api/core/model_runtime/model_providers/spark/llm/llm.py#L50

crazywoola commented 6 days ago

Link https://github.com/langgenius/dify/issues/6887