langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
45.54k stars 6.4k forks source link

Custom segment settings not splitting as expected #7334

Open Delgerskhn opened 1 month ago

Delgerskhn commented 1 month ago

Self Checks

Dify version

0.7.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I have integrated notion document to the dify knowledge and using custom segment identifier "___" for each Q/A segments. But the segmentation result is not splitting by my custom identifier.

Screenshot 2024-08-16 at 09 46 52

✔️ Expected Behavior

It must split exactly 14 segments as I saved them separately in my notion document.

Screenshot 2024-08-16 at 09 49 53

❌ Actual Behavior

But it gives 18 segments that some of the Q/A is taken as 3 separate segment which might result in poor answers for user queries.

Screenshot 2024-08-16 at 09 50 14
dosubot[bot] commented 1 month ago

@Delgerskhn Hello there! I'm here to help you with bug reports, answer questions, and guide you on becoming a contributor. Let's squash those bugs together! 🤖

To resolve the issue of your Notion document not splitting into the expected 14 segments when using the custom segment identifier "___" in Dify version 0.7.0, you need to ensure that the custom segment identifier is correctly implemented in the _get_notion_block_data method.

Currently, the code does not show any implementation for custom segment identifiers like "___". You need to add logic to handle this custom segment identifier.

Here is an example of how you can modify the _get_notion_block_data method to include the custom segment identifier "___":

def _get_notion_block_data(self, page_id: str) -> list[str]:
    result_lines_arr = []
    start_cursor = None
    block_url = BLOCK_CHILD_URL_TMPL.format(block_id=page_id)
    while True:
        query_dict: dict[str, Any] = {} if not start_cursor else {'start_cursor': start_cursor}
        res = requests.request(
            "GET",
            block_url,
            headers={
                "Authorization": "Bearer " + self._notion_access_token,
                "Content-Type": "application/json",
                "Notion-Version": "2022-06-28",
            },
            params=query_dict
        )
        data = res.json()
        for result in data["results"]:
            result_type = result["type"]
            result_obj = result[result_type]
            cur_result_text_arr = []
            if result_type == 'table':
                result_block_id = result["id"]
                text = self._read_table_rows(result_block_id)
                text += "\n\n"
                result_lines_arr.append(text)
            else:
                if "rich_text" in result_obj:
                    for rich_text in result_obj["rich_text"]:
                        # skip if doesn't have text object
                        if "text" in rich_text:
                            text = rich_text["text"]["content"]
                            cur_result_text_arr.append(text)

                result_block_id = result["id"]
                has_children = result["has_children"]
                block_type = result["type"]
                if has_children and block_type != 'child_page':
                    children_text = self._read_block(
                        result_block_id, num_tabs=1
                    )
                    cur_result_text_arr.append(children_text)
            cur_result_text = "\n".join(cur_result_text_arr)
            if result_type in HEADING_SPLITTER:
                result_lines_arr.append(f"{HEADING_SPLITTER[result_type]}{cur_result_text}")
            else:
                result_lines_arr.append(cur_result_text + '\n\n')

        if data["next_cursor"] is None:
            break
        else:
            start_cursor = data["next_cursor"]

    # Split the content by custom segment identifier "___"
    content = "\n".join(result_lines_arr)
    segments = content.split("___")
    return segments

This modification ensures that the content is split by the custom segment identifier "___" and returns the segments as a list [1].

Additionally, when uploading documents to the knowledge base in Dify, you can configure custom segmentation rules. According to the Dify documentation, you can set a segmentation identifier such as "\n" or any other custom identifier, and the system will segment the text when the identifier appears. This feature is part of the custom mode for segmentation and cleaning, which allows you to configure chunk settings and cleaning settings according to different document formats and scenario requirements [2][3][4].

For more detailed steps on how to configure and sync your Notion data with Dify, you can refer to the Dify documentation on syncing from Notion, which includes instructions on authentication, data import, segmentation settings, and synchronization [4].

Additionally, it is recommended to manually check the segment quality before associating the knowledge base with the application. The quality of document segmentation significantly affects the Q&A performance of the knowledge base application. Manual checking and correction can effectively compensate for the shortcomings of machine segmentation in semantic recognition [5].

To continue talking to Dosu, mention @dosu.