Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index

venturaEffect commented 1 year ago

Issue you'd like to raise.

I'm trying to load some documents, powerpoints and text to train my custom LLm using Langchain.

When I run it I come to a weird error message where it tells I don't have "tokenizers" and "taggers" packages (folders).

I've read the docs, asked Langchain chatbot, pip install nltk, uninstall, pip install nltk without dependencies, added them with nltk.download(), nltk.download("punkt"), nltk.download("all"),... Did also manually put on the path: nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data'] and added all the folders. Added the tokenizers folder and taggers folder from the github repo: . Everything. Also asked on the Github repo. Nothing, no success.

Here the code of the file I try to run:

` from nltk.tokenize import sent_tokenize from langchain.document_loaders import UnstructuredPowerPointLoader, TextLoader, UnstructuredWordDocumentLoader from dotenv import load_dotenv, find_dotenv import os import openai import sys import nltk nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data'] nltk.download( 'punkt', download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data')

sys.path.append('../..')

_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY']

folder_path_docx = "DB\ DB VARIADO\DOCS" folder_path_txt = "DB\BLOG-POSTS" folder_path_pptx_1 = "DB\PPT DAY JUNIO" folder_path_pptx_2 = "DB\DB VARIADO\PPTX"

loaded_content = []

for file in os.listdir(folder_path_docx): if file.endswith(".docx"): file_path = os.path.join(folder_path_docx, file) loader = UnstructuredWordDocumentLoader(file_path) docx = loader.load() loaded_content.extend(docx)

for file in os.listdir(folder_path_txt): if file.endswith(".txt"): file_path = os.path.join(folder_path_txt, file) loader = TextLoader(file_path, encoding='utf-8') text = loader.load() loaded_content.extend(text)

for file in os.listdir(folder_path_pptx_1): if file.endswith(".pptx"): file_path = os.path.join(folder_path_pptx_1, file) loader = UnstructuredPowerPointLoader(file_path) slides_1 = loader.load() loaded_content.extend(slides_1)

for file in os.listdir(folder_path_pptx_2): if file.endswith(".pptx"): file_path = os.path.join(folder_path_pptx_2, file) loader = UnstructuredPowerPointLoader(file_path) slides_2 = loader.load() loaded_content.extend(slides_2)

print(loaded_content[0].page_content) print(nltk.data.path)

installed_packages = nltk.downloader.Downloader( download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data').packages()

print(installed_packages)

sent_tokenize("Hello. How are you? I'm well.") `

When running the file I get:

` [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index

HERE SOME TEXT -

['C:\Users\zaesa\AppData\Roaming\nltk_data'] dict_values([, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]) `

And here is how my folders structure from nltk_data looks like:

Suggestion:

I have fresh installed nltk with no dependencies. The version is the latest. The support team from NLTK doesn't know what is wrong. It seems everything is fine. So, it has to be a bug or something coming from Langchain that I'm not able to see. Really appreciate any help. Need to make this work! Thank you.

y-f-a commented 1 year ago

I encounter a similar issue using Google Colab. I am using the qachain and running code I have run many times before, but now receiving this today:

[nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading tokenizers: Package 'tokenizers' not found [nltk_data] in index [nltk_data] Error loading taggers: Package 'taggers' not found in [nltk_data] index

LookupError Traceback (most recent call last) in <cell line: 10>() 8 9 # get the langchain qa chain for the given company ---> 10 qa = get_company_qa_chain(text_dir_path)

17 frames /usr/local/lib/python3.10/dist-packages/nltk/data.py in find(resource_name, paths) 581 sep = "" 70 582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n" --> 583 raise LookupError(resource_not_found) 584 585

LookupError:

Resource averaged_perceptron_tagger not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('averaged_perceptron_tagger')

For more information see: https://www.nltk.org/data.html

Attempted to load taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle

Searched in:

'/root/nltk_data'
'/usr/nltk_data'
'/usr/share/nltk_data'
'/usr/lib/nltk_data'
'/usr/share/nltk_data'
'/usr/local/share/nltk_data'
'/usr/lib/nltk_data'
'/usr/local/lib/nltk_data'

venturaEffect commented 1 year ago

Thanks for sharing.

So have you come to a solution or are you facing the same problem.

Appreciate

y-f-a commented 1 year ago

Sadly, I am facing the same problem and still trying to find the solution. I was just adding some further context in the hope that it could help us. If I find a solution, I will definitely share it here!

y-f-a commented 1 year ago

I found a workaround, for my situation anyway. The problem seems to be with the Directory loader. I went from:

loader = DirectoryLoader(text_dir, glob="*.txt", recursive=True, silent_errors=True)

to

loader = DirectoryLoader(text_dir, glob="*.txt", recursive=True, silent_errors=True, loader_cls=TextLoader)

This also required:

from langchain.document_loaders import DirectoryLoader, TextLoader

Hopefully that works for you too.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

I understand that you're encountering issues with NLTK while working with LangChain. Specifically, you're seeing errors related to the 'tokenizers' and 'taggers' packages not being found. I should note that as an AI, I can only process text and cannot view images or links.

The issue you're experiencing seems to be related to NLTK, a Python library used by LangChain for text processing, rather than LangChain itself. The error messages suggest that NLTK can't locate the 'tokenizers' and 'taggers' packages. However, these are not standard packages in NLTK. The standard tokenizer in NLTK is 'punkt', which you've correctly downloaded with nltk.download('punkt').

This problem might be due to a misconfiguration or misunderstanding. Here are a few suggestions:

Verify that NLTK is correctly installed and up-to-date by running pip install --upgrade nltk.
Confirm that the 'punkt' tokenizer is correctly installed. You can do this by running the following code:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

print(word_tokenize("This is a test."))

If the 'punkt' tokenizer is correctly installed, this should output ['This', 'is', 'a', 'test', '.'].

If you're still encountering issues, you might find it helpful to completely uninstall NLTK, delete your nltk_data directory, and then reinstall NLTK. You can uninstall NLTK by running pip uninstall nltk, and you can reinstall it by running pip install nltk.
If none of the above solutions work, you might want to consider using a different text processing library, such as Spacy. LangChain also supports Spacy, and you can use it by changing NLTKTextSplitter to SpacyTextSplitter in your code.

Please note that this is just one potential solution to your problem. If you believe this is a bug that could impact other users and you've found a solution, we encourage you to make a pull request. If this answer contradicts your experience, you might want to consider upgrading to the latest version of LangChain.

Sources

libs/langchain/langchain/text_splitter.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

akowalsk commented 1 year ago

Upgrading to the latest unstructured fixed for me pip install -U unstructured. See https://github.com/Unstructured-IO/unstructured/commit/e017e99b5ba3226249b34d6c0920bd7c4301da4e

akowalsk commented 1 year ago

Oh sorry I didn't notice you're the same one that posted on the nltk issue.

dosubot[bot] commented 1 year ago

Hi, @venturaEffect! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you encountered an error message related to the "tokenizers" and "taggers" packages not being found when working with Langchain. Another user, @reyfar, also encountered a similar issue and suggested a workaround by modifying the DirectoryLoader and importing TextLoader. In response, I provided potential solutions, including verifying the installation of NLTK and considering using a different text processing library like Spacy. Additionally, @akowalsk suggested upgrading the "unstructured" package as a solution.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

langchain-ai / langchain