langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
47.08k stars 6.69k forks source link

Q&A format choose chinese. Shows 0 paragraphs. Other languages work normally. #5764

Open kissyy16 opened 3 months ago

kissyy16 commented 3 months ago

Self Checks

Dify version

0.6.12-fix1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Default paragraph preview displays normally, but when I select the Q&A segment format and switch to Q&A mode, it shows nothing.

  1. image

  2. image

  3. image

  4. image

✔️ Expected Behavior

Correctly splitting Q&A format

❌ Actual Behavior

No response

dosubot[bot] commented 3 months ago

Hey @kissyy16! 👋 I'm here to help you out with any bugs, questions, and even becoming a contributor. Let's tackle this issue together!

The issue where the Q&A segment format in Dify version 0.6.12-fix1 shows nothing when switched to Q&A mode, despite the default paragraph preview displaying normally, might be due to the rate limit being reached for the Embedding model API key used. This can prevent the Q&A pairs from being generated and displayed correctly [1].

To resolve this, check if the rate limit has been reached for the Embedding model API key. You can monitor or adjust these rate limits by referring to the official documentation of the embedding model provider you are using. For example, if you are using OpenAI, you can check their rate limits and adjust your usage accordingly by visiting OpenAI's rate limits documentation [1].

Additionally, ensure that your API key is correctly configured in your self-hosted Docker environment. You can verify or update the API key by navigating to Settings > Model Provider > Hugging Face > Model Type and checking the API Token field [2].

To continue talking to Dosu, mention @dosu.

kissyy16 commented 3 months ago

I use the local LLM and embedding model

kissyy16 commented 3 months ago

image

kissyy16 commented 3 months ago

The English-language document can be properly segmented, but when uploading a Chinese content document, it defaults to the "Chinese" option. The dropdown menu offers both Simplified and Traditional Chinese as options. Choosing either Simplified or the default "Chinese" does not allow for segmentation.

Please note that this appears to describe a software feature in terms of language selection and text formatting issues.

image

image

kissyy16 commented 3 months ago

Documents using Chinese content can also be segmented when choosing English, but it doesn't work if Chinese is selected. Could this be a bug? image

kissyy16 commented 3 months ago

@JohnJyong Only choose Chinese or Chinese Simplified or Chinese Traditional will be empty. The other languages are worked

kissyy16 commented 3 months ago

@JohnJyong modify request params doc_language="ch" works. You can try.

JohnJyong commented 3 months ago

When the language mark is Chinese, it seems that your llm does not perform very well when generating qa.

JohnJyong commented 3 months ago

could you pls try other models, such as gpt4 ?

sdlszjb commented 3 months ago

too. I use qwen:32b as llm model, and use bge-large-zh-v1.5 as embedding model.

about mouth ago, it still work correct.

but it does't work now.

sdlszjb commented 3 months ago

Now i use qwen2-7b llm model and bge-large-zh-v1.5 embedding model, in Q&A item, I use Chinese Traditional mod word correct.

CharlesSong commented 3 months ago

I

Now i use qwen2-7b llm model and bge-large-zh-v1.5 embedding model, in Q&A item, I use Chinese Traditional mod word correct.

indexing too slow,how to improve?

silence-pan84 commented 3 months ago

@JohnJyong modify request params doc_language="ch" works. You can try.

How to set this parameter

silence-pan84 commented 3 months ago

Has the problem been solved at last? I am also using the "BGE-base-zh-V1.5" embedded model, and the segmented use of English normal simplified Chinese failed

Mingxiangyu commented 2 months ago

我也遇到了同样的问题,简体中文没法实现QA分段,嵌入模型是:chevalblanc/acge_text_embedding,dify版本是:0.6.14 但是我一个月前是可以正常使用的。

I also encountered the same problem. Simplified Chinese cannot implement QA segmentation. The embedding model is: chevalblanc/acge_text_imbedding, and the dify version is 0.6.14 But I was able to use it normally a month ago. @CharlesSong @crazywoola

nhha1602 commented 1 month ago

I had same issue. This issue caused by _format_split_text & format_split_text. If the input text have special character as "*" (some LLM model returns result in markdown format), these function will return empty list -> qa_preview will be empty. And one more thing, if the output not in format: Q1:\nA1:\nQ2:\nA2:... qa_preview also empty.