langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
35.59k stars 4.8k forks source link

fix: document truncation and loss in notion document sync #5631

Open Aurelius-Huang opened 3 days ago

Aurelius-Huang commented 3 days ago

Description

notion extractor only retrieves the first page of many blocks, and the subsequent blocks are lost.

According to the introduction of Pagination in the Notion Developers document, when the number of Blocks contained in a Pagination exceeds 100, it is necessary to obtain them in a paginated manner to get the complete content of the Notion Page.

However, the acquisition method in notion_extractor.py can only successfully obtain the first blocks page of the Notion Page (up to 100). It is not difficult to find out from the Notion Developers document that the reason is that when calling https://api.notion.com/v1/blocks/{block_id}/children, the start_cursor of the next page is mistakenly passed as block_id, while in fact start_cursor is passed through the Query Params of the GET request.

image

In addition, the parameter transmission method of the Query Params of the GET request is also wrong (formal parameter: json -> params).

image

Fixes # (issue)

Type of Change

Please delete options that are not relevant.

How Has This Been Tested?

Find a longer Notion Page (with more than 100 Blocks) and perform the Sync from Notion operation in Knowledge to verify that the version after this PR can synchronize the complete Notion Page content, while the previous version can only obtain the content of the first 100 Blocks, and the other content is lost.

Suggested Checklist: