Chinese chunk begins with a full-stop

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.7.3

Cloud or Self Hosted

Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Upload Chinese documents in knowledge base, when select automatic chunking, some chunk begins with a full-stop。

✔️ Expected Behavior

No response

❌ Actual Behavior

I found a similar unsolved discussion that might be relevant to your issue: Help: CSV format document segmentation problem, which was last updated on 2024-08-08 ^[1].

The issue you're experiencing with some chunks beginning with a full-stop when uploading Chinese documents and selecting automatic chunking in Dify version 0.7.3 is likely due to the segmentation rules in automatic mode. These rules may not handle punctuation correctly, especially in languages like Chinese where sentence boundaries are not as clearly defined by spaces or punctuation as in English.

The automatic chunking algorithm for Chinese text in Dify is implemented using the EnhanceRecursiveCharacterTextSplitter class. This class uses specific parameters for chunk size, chunk overlap, and separators, which include Chinese punctuation like "。". Here is the relevant code snippet:

character_splitter = EnhanceRecursiveCharacterTextSplitter.from_encoder(
    chunk_size=DatasetProcessRule.AUTOMATIC_RULES['segmentation']['max_tokens'],
    chunk_overlap=DatasetProcessRule.AUTOMATIC_RULES['segmentation']['chunk_overlap'],
    separators=["\n\n", "。", ". ", " ", ""],
    embedding_model_instance=embedding_model_instance
)

This setup ensures that the text is split into chunks based on the specified separators, which include the Chinese period "。", making it suitable for processing Chinese text ^[2].

Additionally, the automatic mode is being phased out, and it is recommended to switch to custom mode for better control over segmentation and cleaning strategies. In custom mode, you can configure segmentation and cleaning strategies according to different document formats and scenario requirements, which may help avoid issues like chunks starting with punctuation marks ^[3]^[4]^[5]^[6].

To continue talking to Dosu, mention @dosu.

langgenius / dify