langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
49.44k stars 7.07k forks source link

The language setting in Segmenting in Question & Answer format don't work #9488

Open libukai opened 1 week ago

libukai commented 1 week ago

Self Checks

Dify version

0.9.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

CleanShot 2024-10-18 at 15 47 19@2x

Even the language is set to Chinese Simplified, the result segment of the index is English.

CleanShot 2024-10-18 at 15 49 46@2x

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 1 week ago

There are similar issues and discussions related to the language setting in the Question & Answer segmentation format not working as expected:

  1. An open issue titled "Q&A format choose Chinese. Shows 0 paragraphs. Other languages work normally." suggests that the problem might be caused by the _format_split_text and format_split_text functions. If the input text contains special characters like "*", these functions may return an empty list, resulting in an empty qa_preview. Additionally, if the output is not in the expected format, the qa_preview will also be empty.

  2. A closed issue titled "Adopt Q&A segmentation mode, segmentation use option: simplified Chinese, or a large number of English segments will be generated." indicates that testing with OpenAI GPT-3.5 worked fine. It was suggested to emphasize the use of language above the prompt and to check out the latest code for improvements.

These references might provide insights or solutions to the issue you're experiencing with the language setting in the segmentation format [1][2].

To continue talking to Dosu, mention @dosu.

libukai commented 6 days ago

None of the issues provide useful info. I changed the model and it still generated English segments.

And the language setting auto back to English after the process finished.