Open Delgerskhn opened 1 month ago
@Delgerskhn Hello there! I'm here to help you with bug reports, answer questions, and guide you on becoming a contributor. Let's squash those bugs together! 🤖
To resolve the issue of your Notion document not splitting into the expected 14 segments when using the custom segment identifier "___" in Dify version 0.7.0, you need to ensure that the custom segment identifier is correctly implemented in the _get_notion_block_data
method.
Currently, the code does not show any implementation for custom segment identifiers like "___". You need to add logic to handle this custom segment identifier.
Here is an example of how you can modify the _get_notion_block_data
method to include the custom segment identifier "___":
def _get_notion_block_data(self, page_id: str) -> list[str]:
result_lines_arr = []
start_cursor = None
block_url = BLOCK_CHILD_URL_TMPL.format(block_id=page_id)
while True:
query_dict: dict[str, Any] = {} if not start_cursor else {'start_cursor': start_cursor}
res = requests.request(
"GET",
block_url,
headers={
"Authorization": "Bearer " + self._notion_access_token,
"Content-Type": "application/json",
"Notion-Version": "2022-06-28",
},
params=query_dict
)
data = res.json()
for result in data["results"]:
result_type = result["type"]
result_obj = result[result_type]
cur_result_text_arr = []
if result_type == 'table':
result_block_id = result["id"]
text = self._read_table_rows(result_block_id)
text += "\n\n"
result_lines_arr.append(text)
else:
if "rich_text" in result_obj:
for rich_text in result_obj["rich_text"]:
# skip if doesn't have text object
if "text" in rich_text:
text = rich_text["text"]["content"]
cur_result_text_arr.append(text)
result_block_id = result["id"]
has_children = result["has_children"]
block_type = result["type"]
if has_children and block_type != 'child_page':
children_text = self._read_block(
result_block_id, num_tabs=1
)
cur_result_text_arr.append(children_text)
cur_result_text = "\n".join(cur_result_text_arr)
if result_type in HEADING_SPLITTER:
result_lines_arr.append(f"{HEADING_SPLITTER[result_type]}{cur_result_text}")
else:
result_lines_arr.append(cur_result_text + '\n\n')
if data["next_cursor"] is None:
break
else:
start_cursor = data["next_cursor"]
# Split the content by custom segment identifier "___"
content = "\n".join(result_lines_arr)
segments = content.split("___")
return segments
This modification ensures that the content is split by the custom segment identifier "___" and returns the segments as a list [1].
Additionally, when uploading documents to the knowledge base in Dify, you can configure custom segmentation rules. According to the Dify documentation, you can set a segmentation identifier such as "\n" or any other custom identifier, and the system will segment the text when the identifier appears. This feature is part of the custom mode for segmentation and cleaning, which allows you to configure chunk settings and cleaning settings according to different document formats and scenario requirements [2][3][4].
For more detailed steps on how to configure and sync your Notion data with Dify, you can refer to the Dify documentation on syncing from Notion, which includes instructions on authentication, data import, segmentation settings, and synchronization [4].
Additionally, it is recommended to manually check the segment quality before associating the knowledge base with the application. The quality of document segmentation significantly affects the Q&A performance of the knowledge base application. Manual checking and correction can effectively compensate for the shortcomings of machine segmentation in semantic recognition [5].
To continue talking to Dosu, mention @dosu.
Self Checks
Dify version
0.7.0
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
I have integrated notion document to the dify knowledge and using custom segment identifier "___" for each Q/A segments. But the segmentation result is not splitting by my custom identifier.
✔️ Expected Behavior
It must split exactly 14 segments as I saved them separately in my notion document.
❌ Actual Behavior
But it gives 18 segments that some of the Q/A is taken as 3 separate segment which might result in poor answers for user queries.