Custom segment settings not splitting as expected

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.7.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I have integrated notion document to the dify knowledge and using custom segment identifier "___" for each Q/A segments. But the segmentation result is not splitting by my custom identifier.

✔️ Expected Behavior

It must split exactly 14 segments as I saved them separately in my notion document.

❌ Actual Behavior

But it gives 18 segments that some of the Q/A is taken as 3 separate segment which might result in poor answers for user queries.

@Delgerskhn Hello there! I'm here to help you with bug reports, answer questions, and guide you on becoming a contributor. Let's squash those bugs together! 🤖

To resolve the issue of your Notion document not splitting into the expected 14 segments when using the custom segment identifier "___" in Dify version 0.7.0, you need to ensure that the custom segment identifier is correctly implemented in the _get_notion_block_data method.

Currently, the code does not show any implementation for custom segment identifiers like "___". You need to add logic to handle this custom segment identifier.

Here is an example of how you can modify the _get_notion_block_data method to include the custom segment identifier "___":

def _get_notion_block_data(self, page_id: str) -> list[str]:
    result_lines_arr = []
    start_cursor = None
    block_url = BLOCK_CHILD_URL_TMPL.format(block_id=page_id)
    while True:
        query_dict: dict[str, Any] = {} if not start_cursor else {'start_cursor': start_cursor}
        res = requests.request(
            "GET",
            block_url,
            headers={
                "Authorization": "Bearer " + self._notion_access_token,
                "Content-Type": "application/json",
                "Notion-Version": "2022-06-28",
            },
            params=query_dict
        )
        data = res.json()
        for result in data["results"]:
            result_type = result["type"]
            result_obj = result[result_type]
            cur_result_text_arr = []
            if result_type == 'table':
                result_block_id = result["id"]
                text = self._read_table_rows(result_block_id)
                text += "\n\n"
                result_lines_arr.append(text)
            else:
                if "rich_text" in result_obj:
                    for rich_text in result_obj["rich_text"]:
                        # skip if doesn't have text object
                        if "text" in rich_text:
                            text = rich_text["text"]["content"]
                            cur_result_text_arr.append(text)

                result_block_id = result["id"]
                has_children = result["has_children"]
                block_type = result["type"]
                if has_children and block_type != 'child_page':
                    children_text = self._read_block(
                        result_block_id, num_tabs=1
                    )
                    cur_result_text_arr.append(children_text)
            cur_result_text = "\n".join(cur_result_text_arr)
            if result_type in HEADING_SPLITTER:
                result_lines_arr.append(f"{HEADING_SPLITTER[result_type]}{cur_result_text}")
            else:
                result_lines_arr.append(cur_result_text + '\n\n')

        if data["next_cursor"] is None:
            break
        else:
            start_cursor = data["next_cursor"]

    # Split the content by custom segment identifier "___"
    content = "\n".join(result_lines_arr)
    segments = content.split("___")
    return segments

This modification ensures that the content is split by the custom segment identifier "___" and returns the segments as a list ^[1].

Additionally, when uploading documents to the knowledge base in Dify, you can configure custom segmentation rules. According to the Dify documentation, you can set a segmentation identifier such as "\n" or any other custom identifier, and the system will segment the text when the identifier appears. This feature is part of the custom mode for segmentation and cleaning, which allows you to configure chunk settings and cleaning settings according to different document formats and scenario requirements ^[2]^[3]^[4].

For more detailed steps on how to configure and sync your Notion data with Dify, you can refer to the Dify documentation on syncing from Notion, which includes instructions on authentication, data import, segmentation settings, and synchronization ^[4].

Additionally, it is recommended to manually check the segment quality before associating the knowledge base with the application. The quality of document segmentation significantly affects the Q&A performance of the knowledge base application. Manual checking and correction can effectively compensate for the shortcomings of machine segmentation in semantic recognition ^[5].

To continue talking to Dosu, mention @dosu.

langgenius / dify