Closed Sumkor closed 1 month ago
I found a similar unsolved discussion that might be relevant to your issue: Help: CSV format document segmentation problem, which was last updated on 2024-08-08 [1].
The issue you're experiencing with some chunks beginning with a full-stop when uploading Chinese documents and selecting automatic chunking in Dify version 0.7.3 is likely due to the segmentation rules in automatic mode. These rules may not handle punctuation correctly, especially in languages like Chinese where sentence boundaries are not as clearly defined by spaces or punctuation as in English.
The automatic chunking algorithm for Chinese text in Dify is implemented using the EnhanceRecursiveCharacterTextSplitter
class. This class uses specific parameters for chunk size, chunk overlap, and separators, which include Chinese punctuation like "。". Here is the relevant code snippet:
character_splitter = EnhanceRecursiveCharacterTextSplitter.from_encoder(
chunk_size=DatasetProcessRule.AUTOMATIC_RULES['segmentation']['max_tokens'],
chunk_overlap=DatasetProcessRule.AUTOMATIC_RULES['segmentation']['chunk_overlap'],
separators=["\n\n", "。", ". ", " ", ""],
embedding_model_instance=embedding_model_instance
)
This setup ensures that the text is split into chunks based on the specified separators, which include the Chinese period "。", making it suitable for processing Chinese text [2].
Additionally, the automatic mode is being phased out, and it is recommended to switch to custom mode for better control over segmentation and cleaning strategies. In custom mode, you can configure segmentation and cleaning strategies according to different document formats and scenario requirements, which may help avoid issues like chunks starting with punctuation marks [3][4][5][6].
To continue talking to Dosu, mention @dosu.
Self Checks
Dify version
0.7.3
Cloud or Self Hosted
Self Hosted (Docker), Self Hosted (Source)
Steps to reproduce
Upload Chinese documents in knowledge base, when select automatic chunking, some chunk begins with a full-stop。
✔️ Expected Behavior
No response
❌ Actual Behavior