Unstructured text extractor missing nltk/data punkt

langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.

Other

38.21k stars 5.21k forks source link

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.8

Cloud or Self Hosted

Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Upload textfile using unstructured ETL type to knowledge.

✔️ Expected Behavior

import nltk punk before usage.

❌ Actual Behavior

nltk is installed with unstructured, but maybe needed to be imported in the unstructured extractors?

Error: with indexing estimate api, core/rag/extractor/unstructured/unstructured_text_extractor.py in extract at line 29

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Suggestion:

import nltk

# Download the 'punkt' resource
nltk.download('punkt')

import logging import nltk from core.rag.extractor.extractor_base import BaseExtractor from core.rag.models.document import Document logger = logging.getLogger(__name__) class UnstructuredTextExtractor(BaseExtractor): """Load msg files. Args: file_path: Path to the file to load. """ def __init__( self, file_path: str, api_url: str ): """Initialize with file path.""" self._file_path = file_path self._api_url = api_url # Ensure the 'punkt' resource is downloaded try: nltk.data.find('tokenizers/punkt') except LookupError: nltk.download('punkt') def extract(self) -> list[Document]: from unstructured.partition.text import partition_text elements = partition_text(filename=self._file_path) from unstructured.chunking.title import chunk_by_title chunks = chunk_by_title(elements, max_characters=2000, combine_text_under_n_chars=2000) documents = [] for chunk in chunks: text = chunk.text.strip() documents.append(Document(page_content=text)) return documents

langgenius / dify