langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
47.11k stars 6.69k forks source link

Cohere.tokenize Request Results in JSONDecodeError: Expecting value: line 1 column 1 (char 0) #5748

Closed XiaoLey closed 1 month ago

XiaoLey commented 3 months ago

Self Checks

Dify version

0.6.12

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

When using Dify to perform "Text Segmentation and Cleaning" on a knowledge base, a JSONDecodeError occurs during a tokenize request to the Cohere API, preventing the proper decoding of the API response. The detailed error message is as follows:

worker-1  | [2024-06-29 11:32:55,999: INFO/MainProcess] HTTP Request: POST https://api.cohere.ai/v1/tokenize "HTTP/1.1 200 OK"
worker-1  | [2024-06-29 11:32:56,130: INFO/MainProcess] HTTP Request: POST https://api.cohere.ai/v1/tokenize "HTTP/1.1 500 Internal Server Error"
worker-1  | [2024-06-29 11:32:56,131: ERROR/MainProcess] consume document failed
worker-1  | Traceback (most recent call last):
worker-1  |   File "/app/api/core/model_runtime/model_providers/cohere/text_embedding/text_embedding.py", line 122, in get_num_tokens
worker-1  |     response = self._tokenize(
worker-1  |   File "/app/api/core/model_runtime/model_providers/cohere/text_embedding/text_embedding.py", line 146, in _tokenize
worker-1  |     response = client.tokenize(
worker-1  |   File "/app/api/.venv/lib/python3.10/site-packages/cohere/client.py", line 218, in tokenize
worker-1  |     return super().tokenize(text=text, model=model, request_options=opts)
worker-1  |   File "/app/api/.venv/lib/python3.10/site-packages/cohere/base_client.py", line 1557, in tokenize
worker-1  |     typing.cast(typing.Any, construct_type(type_=typing.Any, object_=_response.json()))  # type: ignore
worker-1  |   File "/app/api/.venv/lib/python3.10/site-packages/httpx/_models.py", line 764, in json
worker-1  |     return jsonlib.loads(self.content, **kwargs)
worker-1  |   File "/usr/local/lib/python3.10/json/__init__.py", line 346, in loads
worker-1  |     return _default_decoder.decode(s)
worker-1  |   File "/usr/local/lib/python3.10/json/decoder.py", line 337, in decode
worker-1  |     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
worker-1  |   File "/usr/local/lib/python3.10/json/decoder.py", line 355, in raw_decode
worker-1  |     raise JSONDecodeError("Expecting value", s, err.value) from None
worker-1  | json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
worker-1  | 
worker-1  | During handling of the above exception, another exception occurred:
worker-1  | 
worker-1  | Traceback (most recent call last):
worker-1  |   File "/app/api/core/indexing_runner.py", line 73, in run
worker-1  |     self._load_segments(dataset, dataset_document, documents)
worker-1  |   File "/app/api/core/indexing_runner.py", line 848, in _load_segments
worker-1  |     doc_store.add_documents(documents)
worker-1  |   File "/app/api/core/rag/docstore/dataset_docstore.py", line 97, in add_documents
worker-1  |     tokens = embedding_model.get_text_embedding_num_tokens(
worker-1  |   File "/app/api/core/model_manager.py", line 189, in get_text_embedding_num_tokens
worker-1  |     return self._round_robin_invoke(
worker-1  |   File "/app/api/core/model_manager.py", line 303, in _round_robin_invoke
worker-1  |     return function(*args, **kwargs)
worker-1  |   File "/app/api/core/model_runtime/model_providers/cohere/text_embedding/text_embedding.py", line 128, in get_num_tokens
worker-1  |     raise self._transform_invoke_error(e)
worker-1  | core.model_runtime.errors.invoke.InvokeError: [cohere] Error: Expecting value: line 1 column 1 (char 0)

✔️ Expected Behavior

No response

❌ Actual Behavior

The Cohere API returns a 500 Internal Server Error

dosubot[bot] commented 3 months ago

To resolve the JSONDecodeError during a tokenize request to the Cohere API when using Dify version 0.6.12, self-hosted via Docker, you should ensure that the JSON payload you are sending is correctly formatted. Here is an example of how to properly format the JSON payload for a tokenize request:

{
    "name": "Dify",
    "text": "Dify means Do it for you...",
    "indexing_technique": "high_quality",
    "process_rule": {
        "rules": {
            "pre_processing_rules": [
                {
                    "id": "remove_extra_spaces",
                    "enabled": true
                },
                {
                    "id": "remove_urls_emails",
                    "enabled": true
                }
            ],
            "segmentation": {
                "separator": "###",
                "max_tokens": 500
            }
        },
        "mode": "custom"
    }
}

Ensure that the JSON structure is correct and that all necessary fields are included. If the JSON payload is malformed or missing required fields, it can lead to a JSONDecodeError [1].

Additionally, there are known issues or limitations with the Cohere API that could cause a 500 Internal Server Error during a tokenize request. Specifically, the InvokeServerUnavailableError is mapped to cohere.errors.internal_server_error.InternalServerError, which indicates that an internal server error from the Cohere API can result in a 500 Internal Server Error [2].

To get more detailed information about the tokenize request, you can configure the logging level to DEBUG:

import logging

# Set the logging level to DEBUG
logging.basicConfig(level=logging.DEBUG)

# Example usage
logger = logging.getLogger(__name__)
logger.debug("This is a debug message")

In the context of the provided code, you can add debug logging statements to the _tokenize method to log detailed information about the tokenize request:

def _tokenize(self, model: str, credentials: dict, text: str) -> list[str]:
    """
    Tokenize text
    :param model: model name
    :param credentials: model credentials
    :param text: text to tokenize
    :return:
    """
    if not text:
        return []

    # initialize client
    client = cohere.Client(credentials.get('api_key'), base_url=credentials.get('base_url'))

    logger.debug(f"Tokenizing text: {text} with model: {model}")

    response = client.tokenize(
        text=text,
        model=model,
        offline=False,
        request_options=RequestOptions(max_retries=0)
    )

    logger.debug(f"Tokenize response: {response.token_strings}")

    return response.token_strings

By setting the logging level to DEBUG and adding debug logging statements, you can get more detailed information about the tokenize request to the Cohere API [3][2].

To continue talking to Dosu, mention @dosu.

laipz8200 commented 3 months ago

Looks like this issue is occurring within the Cohere client. Could you provide more detailed information? For example, what content are you processing when this error is triggered?

XiaoLey commented 2 months ago

I initiated a high-quality text segmentation and cleaning process on the content extracted from https://swagger.io/specification/ using Firecrawl. After waiting for quite some time, an exception occurred. It seems that a large portion of the content was processed successfully as I can see some segmentation results, but the results are incomplete.

@laipz8200