Closed AndyMik90 closed 1 week ago
To resolve the issue with the NLTK 'punkt' resource not being found in Dify version 0.6.8, you need to ensure that the 'punkt' resource is downloaded. You can do this by adding the following code to download the 'punkt' resource if it is not already available.
Here is the updated UnstructuredTextExtractor
class with the necessary code to download the 'punkt' resource:
import logging
import nltk
from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document
logger = logging.getLogger(__name__)
class UnstructuredTextExtractor(BaseExtractor):
"""Load msg files.
Args:
file_path: Path to the file to load.
"""
def __init__(
self,
file_path: str,
api_url: str
):
"""Initialize with file path."""
self._file_path = file_path
self._api_url = api_url
# Ensure the 'punkt' resource is downloaded
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
def extract(self) -> list[Document]:
from unstructured.partition.text import partition_text
elements = partition_text(filename=self._file_path)
from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements, max_characters=2000, combine_text_under_n_chars=2000)
documents = []
for chunk in chunks:
text = chunk.text.strip()
documents.append(Document(page_content=text))
return documents
This code checks if the 'punkt' tokenizer is available and downloads it if it is not, ensuring that the NLTK 'punkt' resource is available when needed [1].
To continue the conversation, mention @dosu.
Self Checks
Dify version
0.6.8
Cloud or Self Hosted
Self Hosted (Docker), Self Hosted (Source)
Steps to reproduce
Upload textfile using unstructured ETL type to knowledge.
✔️ Expected Behavior
import nltk punk before usage.
❌ Actual Behavior
nltk is installed with unstructured, but maybe needed to be imported in the unstructured extractors?
Error: with indexing estimate api, core/rag/extractor/unstructured/unstructured_text_extractor.py in extract at line 29
Suggestion: