langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
45.29k stars 6.36k forks source link

Error in "Text Preprocessing and Cleaning" stage when adding new files to existing knowledge base #7516

Closed cht-k closed 1 week ago

cht-k commented 3 weeks ago

Self Checks

Dify version

0.7.1

Cloud or Self Hosted

Cloud

Steps to reproduce

When attempting to add additional documents to an existing Knowledge in Dify Cloud, an error occurs during the "Text Preprocessing and Cleaning" stage. The error message suggests that there's an issue with NLTK (Natural Language Toolkit) being unable to locate a resource named "punkt_tab".

Steps to Reproduce:

  1. Navigate to an existing Knowledge base
  2. Click the "Add file" button
  3. On the "Upload file" page, upload a new document and click "Next"
  4. Error occurs upon entering the "Text Preprocessing and Cleaning" page

image

{
    "code": "indexing_estimate_error",
    "message": "\n**********************************************************************\n  Resource \u001b[93mpunkt_tab\u001b[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \u001b[31m>>> import nltk\n  >>> nltk.download('punkt_tab')\n  \u001b[0m\n  For more information see: https://www.nltk.org/data.html\n\n  Attempted to load \u001b[93mtokenizers/punkt_tab/english/\u001b[0m\n\n  Searched in:\n    - '/root/nltk_data'\n    - '/app/api/.venv/nltk_data'\n    - '/app/api/.venv/share/nltk_data'\n    - '/app/api/.venv/lib/nltk_data'\n    - '/usr/share/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/lib/nltk_data'\n    - '/usr/local/lib/nltk_data'\n**********************************************************************\n",
    "status": 500
}

✔️ Expected Behavior

Ability to embed new documents and add them to an existing Knowledge base.

❌ Actual Behavior

Error message appears when entering the "Text Preprocessing and Cleaning" page.

dosubot[bot] commented 3 weeks ago

Hello @cht-k! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.

The error you're encountering indicates that the NLTK resource punkt_tab is missing and needs to be downloaded. The error message provides instructions on how to download the resource using the NLTK Downloader.

To resolve this issue, you can add the following code snippet to your initialization script or setup file to ensure that the punkt_tab resource is downloaded:

import nltk

nltk.download('punkt')

This will download the necessary punkt tokenizer data, which includes punkt_tab.

If you have access to the codebase, you can add this snippet to the appropriate initialization section. If not, you may need to contact the maintainers or support team to have this change implemented.

Currently, there is no specific information in the Dify codebase regarding the handling or downloading of NLTK resources, including punkt_tab [1][2].

To continue talking to Dosu, mention @dosu.

crazywoola commented 3 weeks ago

What system processing LLM did you choose?

cht-k commented 3 weeks ago

I'm not entirely sure what specific information you're requesting here, but I can provide some details that might be relevant:

From my observation, this issue mainly affects the process when the chunk settings are set to "automatic".

It's worth noting that I'm using the Dify Cloud version, and these error messages appear to be generated by Dify Cloud's backend. Therefore, I don't believe my local system information is relevant to this issue.

Could you please clarify if you need any other specific information about the LLM or system settings? I'd be happy to provide more details if I can locate them in the Dify Cloud interface.

tamanobi commented 3 weeks ago

I am also facing the same error. I use Dify Cloud. When I create an empty knowledge database, Dify always happens the error.

I was able to avoid the punk_tab error by following these steps:

If you use HTTP API to create document, choose "custom" mode of process_rule.

crazywoola commented 1 week ago

This should be resolved in #7582