langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
50.25k stars 7.2k forks source link

Notion syncing only some pages #9012

Open BURG3R5 opened 3 weeks ago

BURG3R5 commented 3 weeks ago

Self Checks

Dify version

0.9.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Followed the typical steps to connect Dify to Notion using an internal integration.

✔️ Expected Behavior

We have 1228 pages in our Notion database. All of them should be indexed into the Knowledge Base.

❌ Actual Behavior

The selection screen shows 125 pages selected and upon initial sync completion, only 107 documents are indexed.

Screenshot 2024-10-04 at 3 03 11 PM Screenshot 2024-10-04 at 3 01 15 PM
dosubot[bot] commented 3 weeks ago

Hey @BURG3R5! 👋 I'm here to help you with any bugs, questions, or contributions you have for Dify. I'm a bot designed to assist you while waiting for a human maintainer. Let's tackle this issue together!

I found a similar issue that was previously reported and closed: [Cloud] Missing added pages when trying to add a new page to Knowledge Base with Notion integration. This issue was resolved by a pull request that fixed the problem with adding Notion pages to the knowledge base [1].

To troubleshoot and resolve your issue with incomplete syncing of Notion pages in Dify, you can follow these steps:

  1. Authorization Verification: Ensure that the authorization verification is completed successfully. This involves binding the Notion content to Dify either during dataset creation or through the settings under Data Sources.

  2. Check Integration Configuration: Verify that the integration with Notion is correctly configured. Ensure that the environment variables in the .env file are set correctly for either internal or public integration.

  3. Segmentation and Cleaning: Confirm that the segmentation settings and indexing method are correctly configured. This affects how data is processed and imported.

  4. Synchronization Process: If the Notion content is modified, ensure that the synchronization is triggered by clicking Sync in the Dify dataset Document List Page.

  5. Data Type Limitations: Be aware that images and files are not supported for import, and tabular data will be converted to text display, which might affect the completeness of the sync.

  6. Automatic Sync Setup: Ensure that the automatic sync is set up correctly so that any updates in Notion are automatically reflected in Dify.

If these steps do not resolve the issue, further investigation into the specific configuration and logs may be necessary. Additionally, ensure that you are following the specific steps for internal or public integration as outlined in the Dify documentation [2][3][4].

To continue talking to Dosu, mention @dosu.

BURG3R5 commented 3 weeks ago

@dosu The mentioned issue is not really similar to mine. I need help from a human maintainer.

dosubot[bot] commented 3 weeks ago

This is a tough nut to crack, could really use your help on this, @takatost.

To continue talking to Dosu, mention @dosu.

BURG3R5 commented 3 weeks ago

Additionally, I have already attempted the fixes the bot suggested, and none of them helped.

JirehC commented 2 weeks ago

Is there any updates on this issue? I faced the same problem of missing pages during sync, both cloud and self hosted

BURG3R5 commented 2 weeks ago

@JirehC I've moved on to downloading Notion pages as Markdown and uploading them to an empty knowledge base using the API. Can't trust Dify to respond in time or even give proper error messages, so I'm implementing the upload and sync processes myself. At least this way I have more control over the processes and can actually see what's going wrong.

Not gonna close the issue because this is still an unsolved problem in the product.

JirehC commented 2 weeks ago

@BURG3R5 I've found a potential cause for this, and it's not something we can fix, it might even be a Notion Problem rather than aDify problem. A potential problem with the syncing is the 'time-out' issue. For example, exporting Notion workspace to Anytype through integration also has the same issue as Dify where pages gone missing. Anytype explained that the file size could be too big to make it before the time-out.

I've also moved to download Notion pages as markdown but got fed up uploading them one by one because I couldn't find anything about uploading via API. Could you send me the link to the documentation on how to do it?

Edit: Is this the documentation you follow? https://docs.dify.ai/guides/knowledge-base/maintain-dataset-via-api

Screenshot 2024-10-11 at 17 07 30

BURG3R5 commented 2 weeks ago

@JirehC

  1. Regarding the API: Yes, I am using that alongside the notion-to-md library.
  2. Regarding the issue: Extending the timeout on the server-side and using exponential back-off to avoid rate-limits should be simple enough for Dify maintainers. Notion's "retrieve block children" API is paginated, so even if there are many blocks inside the document, setting the page_size parameter to be small enough should resolve the timeout issues.

I haven't explored the Dify repository myself to figure out exactly where the issue is, since this is a massive mono-repo containing everything from frontend to client SDKs to backend and so on. I wish a maintainer would provide insights on what is the code that's failing.

JirehC commented 2 weeks ago

Thanks @BURG3R5. I haven't discovered that library, this is great. Thanks for sharing!

Yawen-1010 commented 1 week ago

@crazywoola @JohnJyong