langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
48.01k stars 6.85k forks source link

Chinese chunk begins with a full-stop #7929

Closed Sumkor closed 1 month ago

Sumkor commented 1 month ago

Self Checks

Dify version

0.7.3

Cloud or Self Hosted

Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Upload Chinese documents in knowledge base, when select automatic chunking, some chunk begins with a full-stop。

✔️ Expected Behavior

No response

❌ Actual Behavior

图片

dosubot[bot] commented 1 month ago

I found a similar unsolved discussion that might be relevant to your issue: Help: CSV format document segmentation problem, which was last updated on 2024-08-08 [1].

The issue you're experiencing with some chunks beginning with a full-stop when uploading Chinese documents and selecting automatic chunking in Dify version 0.7.3 is likely due to the segmentation rules in automatic mode. These rules may not handle punctuation correctly, especially in languages like Chinese where sentence boundaries are not as clearly defined by spaces or punctuation as in English.

The automatic chunking algorithm for Chinese text in Dify is implemented using the EnhanceRecursiveCharacterTextSplitter class. This class uses specific parameters for chunk size, chunk overlap, and separators, which include Chinese punctuation like "。". Here is the relevant code snippet:

character_splitter = EnhanceRecursiveCharacterTextSplitter.from_encoder(
    chunk_size=DatasetProcessRule.AUTOMATIC_RULES['segmentation']['max_tokens'],
    chunk_overlap=DatasetProcessRule.AUTOMATIC_RULES['segmentation']['chunk_overlap'],
    separators=["\n\n", "。", ". ", " ", ""],
    embedding_model_instance=embedding_model_instance
)

This setup ensures that the text is split into chunks based on the specified separators, which include the Chinese period "。", making it suitable for processing Chinese text [2].

Additionally, the automatic mode is being phased out, and it is recommended to switch to custom mode for better control over segmentation and cleaning strategies. In custom mode, you can configure segmentation and cleaning strategies according to different document formats and scenario requirements, which may help avoid issues like chunks starting with punctuation marks [3][4][5][6].

To continue talking to Dosu, mention @dosu.