langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
38.21k stars 5.21k forks source link

Unstructured text extractor missing nltk/data punkt #4659

Closed AndyMik90 closed 1 week ago

AndyMik90 commented 1 month ago

Self Checks

Dify version

0.6.8

Cloud or Self Hosted

Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Upload textfile using unstructured ETL type to knowledge.

✔️ Expected Behavior

import nltk punk before usage.

❌ Actual Behavior

nltk is installed with unstructured, but maybe needed to be imported in the unstructured extractors?

Error: with indexing estimate api, core/rag/extractor/unstructured/unstructured_text_extractor.py in extract at line 29

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Suggestion:

import nltk

# Download the 'punkt' resource
nltk.download('punkt')
dosubot[bot] commented 1 month ago

To resolve the issue with the NLTK 'punkt' resource not being found in Dify version 0.6.8, you need to ensure that the 'punkt' resource is downloaded. You can do this by adding the following code to download the 'punkt' resource if it is not already available.

Here is the updated UnstructuredTextExtractor class with the necessary code to download the 'punkt' resource:

import logging
import nltk

from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document

logger = logging.getLogger(__name__)

class UnstructuredTextExtractor(BaseExtractor):
    """Load msg files.

    Args:
        file_path: Path to the file to load.
    """

    def __init__(
        self,
        file_path: str,
        api_url: str
    ):
        """Initialize with file path."""
        self._file_path = file_path
        self._api_url = api_url

        # Ensure the 'punkt' resource is downloaded
        try:
            nltk.data.find('tokenizers/punkt')
        except LookupError:
            nltk.download('punkt')

    def extract(self) -> list[Document]:
        from unstructured.partition.text import partition_text

        elements = partition_text(filename=self._file_path)
        from unstructured.chunking.title import chunk_by_title
        chunks = chunk_by_title(elements, max_characters=2000, combine_text_under_n_chars=2000)
        documents = []
        for chunk in chunks:
            text = chunk.text.strip()
            documents.append(Document(page_content=text))

        return documents

This code checks if the 'punkt' tokenizer is available and downloads it if it is not, ensuring that the NLTK 'punkt' resource is available when needed [1].

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.