infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
23.99k stars 2.35k forks source link

[Bug]: Web App API **ERROR**: An error occurred during streaming #3062

Open nakroy opened 1 month ago

nakroy commented 1 month ago

Is there an existing issue for the same bug?

Branch name

v0.12.0

Commit ID

na

Other environment information

Hardware parameters: 
GPU: RTX 4090

OS type: Ubuntu 22.04 LTS

LLM Backend: Xinference (v0.15.4)

LLM Model: Mistral-Nemo-Instruct-2407

Actual behavior

I create an agent for testing. When I run for test chat, the agent can answer well( The LLM backend is Xinference and the LLM I use is Mistral-Nemo-Instruct-2407).

But when I try using API and starting a Web App, it failed to answer the question, with an error response: ERROR: An error occurred during streaming

Expected behavior

API Web App can chat as normal as basic chat.

Steps to reproduce

1. Create an agent using "General-purpose chatbot" template;
2. Use Xinference LLM for answering questions rather than the default model deepseek;
3. Directly launch the agent and test chating (work well);
4. Use API for chat, create an API token and then start a Web App (preview);
5. Test chating in Web App, failed with errors"**ERROR**: An error occurred during streaming"

Additional information

Xinference Worker Error log:

2024-10-27 23:26:43,243 xinference.core.model 6784 DEBUG    After request chat, current serve request count: 0 for the model Mistral-Nemo-Instruct-2407
2024-10-27 23:27:02,947 xinference.core.worker 1 DEBUG    Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7efcc72464d0>, kwargs: model_uid=Mistral-Nemo-Instruct-2407-1-0
2024-10-27 23:27:02,948 xinference.core.worker 1 DEBUG    Leave get_model, elapsed time: 0 s
2024-10-27 23:27:02,954 xinference.core.worker 1 DEBUG    Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7efcc72464d0>, kwargs: model_uid=Mistral-Nemo-Instruct-2407-1-0
2024-10-27 23:27:02,954 xinference.core.worker 1 DEBUG    Leave describe_model, elapsed time: 0 s
2024-10-27 23:27:02,960 xinference.core.model 6784 DEBUG    Request chat, current serve request count: 0, request limit: None for the model Mistral-Nemo-Instruct-2407
2024-10-27 23:27:02,961 xinference.core.model 6784 DEBUG    [request a519daea-94f5-11ef-aade-067b69b26261] Enter chat, args: <xinference.core.model.ModelActor object at 0x7f3a7f75f880>,[{'role': 'system', 'content': "\n        You're a text classifier. You need to categorize the user’...,{'frequency_penalty': 0.7, 'max_tokens': 256, 'presence_penalty': 0.4, 'temperature': 0.28, 'top_p':..., kwargs: raw_params={'frequency_penalty': 0.7, 'max_tokens': 256, 'presence_penalty': 0.4, 'temperature': 0.28, 'top_p':...
2024-10-27 23:27:02,962 xinference.model.llm.vllm.core 6784 DEBUG    Enter generate, prompt: <s>[INST]
        You're a text classifier. You need to categorize the user’s questions into 5 categories, 
        namely: chitchatting/finance/medical/other/weather
        Here's description of each category:
        --------------------
Category: chitchatting
Description: Regarding the issues of small talk, companionship, sharing, and emotional intimacy
--------------------
Category: finance
Description: Question is about finace/economic information, stock market, economic news.
--------------------
Category: medical
Description: Question is about medical issue, health, illness or medicine etc,.
--------------------
Category: weather
Description: Question is about weather.
        You could learn from the following examples:
        Question: What's your name? Category: chitchatting- Question: Are you female o male. Category: chitchatting- Question: What a bad day! Category: chitchatting- Question: Terrible day. Category: chitchatting- Question: How are you today? Category: chitchatting- Question: Do you have boy friend? Category: chitchatting- Question: 昨日涨幅大于5%的军工股? Category: finance- Question: Stocks have MACD buyin signals? Category: finance- Question: When is the next interest rate cut by the Federal Reserve? Category: finance- Question: 国家救市都有哪些举措? Category: finance- Question: How to relieve the headache? Category: medical- Question: Can't sleep, what to do? Category: medical- Question: What the effect of coffee in terms of losing weight? Category: medical- Question: Will it rain tomorrow? Category: weather- Question: Is it sunny next day? Category: weather- Question: What is average temperature next week? Category: weather
        You could learn from the above examples.
        Just mention the category names, no need for any additional words.

Question: hi, who are youCategory: [/INST], generate config: {'frequency_penalty': 0.7, 'max_tokens': 256, 'presence_penalty': 0.4, 'temperature': 0.28, 'top_p': 0.3, 'stop': ['</s>'], 'stop_token_ids': [2]}
INFO 10-27 23:27:02 async_llm_engine.py:206] Added request a519f4da-94f5-11ef-aade-067b69b26261.
INFO 10-27 23:27:03 metrics.py:351] Avg prompt throughput: 21.2 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 10-27 23:27:03 async_llm_engine.py:174] Finished request a519f4da-94f5-11ef-aade-067b69b26261.
INFO 10-27 23:27:03 async_llm_engine.py:174] Finished request a519f4da-94f5-11ef-aade-067b69b26261.
2024-10-27 23:27:03,247 xinference.core.model 6784 DEBUG    [request a519daea-94f5-11ef-aade-067b69b26261] Leave chat, elapsed time: 0 s
2024-10-27 23:27:03,247 xinference.core.model 6784 DEBUG    After request chat, current serve request count: 0 for the model Mistral-Nemo-Instruct-2407
2024-10-27 23:27:03,382 xinference.core.worker 1 DEBUG    Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7efcc72464d0>, kwargs: model_uid=Mistral-Nemo-Instruct-2407-1-0
2024-10-27 23:27:03,382 xinference.core.worker 1 DEBUG    Leave get_model, elapsed time: 0 s
2024-10-27 23:27:03,388 xinference.core.worker 1 DEBUG    Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7efcc72464d0>, kwargs: model_uid=Mistral-Nemo-Instruct-2407-1-0
2024-10-27 23:27:03,389 xinference.core.worker 1 DEBUG    Leave describe_model, elapsed time: 0 s
2024-10-27 23:27:03,394 xinference.core.model 6784 DEBUG    Request chat, current serve request count: 0, request limit: None for the model Mistral-Nemo-Instruct-2407
2024-10-27 23:27:03,395 xinference.core.model 6784 DEBUG    [request a55c2652-94f5-11ef-aade-067b69b26261] Enter chat, args: <xinference.core.model.ModelActor object at 0x7f3a7f75f880>,[{'role': 'system', 'content': 'Role:  You‘re warm-hearted lovely young girl, 22 years old, located ...,{'frequency_penalty': 0.7, 'presence_penalty': 0.4, 'temperature': 0.45, 'top_p': 0.3, 'stream': Tru..., kwargs: raw_params={'frequency_penalty': 0.7, 'presence_penalty': 0.4, 'stream': True, 'temperature': 0.45, 'top_p': 0....
2024-10-27 23:27:03,396 xinference.core.model 6784 ERROR    [request a55c2652-94f5-11ef-aade-067b69b26261] Leave chat, error: After the optional system message, conversation roles must alternate user/assistant/user/assistant/..., elapsed time: 0 s
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 78, in wrapped
    ret = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 709, in chat
    response = await self._call_wrapper_json(
  File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 517, in _call_wrapper_json
    return await self._call_wrapper("json", fn, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 122, in _async_wrapper
    return await fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 526, in _call_wrapper
    ret = await fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/vllm/utils.py", line 30, in _async_wrapper
    return await fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/vllm/core.py", line 690, in async_chat
    full_prompt = self.get_full_context(
  File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 119, in get_full_context
    return self._build_from_raw_template(messages, chat_template, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 93, in _build_from_raw_template
    rendered = compiled_template.render(
  File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/usr/local/lib/python3.10/dist-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 18, in top-level template code
  File "/usr/local/lib/python3.10/dist-packages/jinja2/sandbox.py", line 394, in call
    return __context.call(__obj, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 83, in raise_exception
    raise TemplateError(message)
jinja2.exceptions.TemplateError: After the optional system message, conversation roles must alternate user/assistant/user/assistant/...
2024-10-27 23:27:03,397 xinference.core.model 6784 DEBUG    After request chat, current serve request count: 0 for the model Mistral-Nemo-Instruct-2407
KevinHuSh commented 4 weeks ago

By the error message, Xinference seems not adapt OpenAI compatible API for ' Mistral-Nemo-Instruct-2407'. You could submit a feature request at github of Xinference.

nakroy commented 4 weeks ago

By the error message, Xinference seems not adapt OpenAI compatible API for ' Mistral-Nemo-Instruct-2407'. You could submit a feature request at github of Xinference.

Alright, I may test llama-3.1 and see if this model can work fine. I would figure out and check whether Xinference can fully support Nemo models...