Closed mballav closed 1 month ago
To resolve the "No such file or directory: '/root/nltk_data/tokenizers/punkt/PY3_tab'" error after upgrading llama-index
and llama-index-core
to version 0.11.8, you need to ensure that the NLTK data, specifically the punkt
tokenizer, is properly downloaded and accessible. The GlobalsHelper
class in the code handles this by downloading the necessary NLTK data if it is not found.
Here is the relevant part of the code that ensures the punkt
tokenizer is downloaded:
class GlobalsHelper:
"""Helper to retrieve globals.
Helpful for global caching of certain variables that can be expensive to load.
(e.g. tokenization)
"""
_stopwords: Optional[List[str]] = None
_nltk_data_dir: Optional[str] = None
def __init__(self) -> None:
"""Initialize NLTK stopwords and punkt."""
import nltk
self._nltk_data_dir = os.environ.get(
"NLTK_DATA",
os.path.join(
os.path.dirname(os.path.abspath(__file__)),
"_static/nltk_cache",
),
)
if self._nltk_data_dir not in nltk.data.path:
nltk.data.path.append(self._nltk_data_dir)
# ensure access to data is there
try:
nltk.data.find("corpora/stopwords", paths=[self._nltk_data_dir])
except LookupError:
nltk.download("stopwords", download_dir=self._nltk_data_dir)
try:
nltk.data.find("tokenizers/punkt", paths=[self._nltk_data_dir])
except LookupError:
nltk.download("punkt", download_dir=self._nltk_data_dir)
To resolve the error, you can manually download the punkt
tokenizer by running the following Python code:
import nltk
nltk.download('punkt')
Alternatively, you can set the NLTK_DATA
environment variable to a directory where NLTK data can be downloaded and accessed:
import os
import nltk
# Set the NLTK_DATA environment variable
os.environ['NLTK_DATA'] = '/path/to/your/nltk_data'
# Ensure the directory exists
if not os.path.exists(os.environ['NLTK_DATA']):
os.makedirs(os.environ['NLTK_DATA'])
# Download the punkt tokenizer
nltk.download('punkt', download_dir=os.environ['NLTK_DATA'])
Replace '/path/to/your/nltk_data'
with the actual path where you want to store the NLTK data. This should resolve the "No such file or directory" error [1].
To continue talking to Dosu, mention @dosu.
@mballav what does your dockerfile look like? How are you installing things? I've been running the latest in docker for a while and have never seen this
Here you go @logan-markewich. Thanks a lot for the prompt reply.
FROM quaeris/docsearch-ma:multiarch2 as local-development
COPY src/requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY ./src/ssi.txt ssi.txt
RUN pip install -r ssi.txt
RUN rm ssi.txt
RUN pip install PyJWT[crypto]==2.8.0
COPY ./nltk_data /root/nltk_data
WORKDIR /home
COPY ./src/run_production_server.sh ./run_production_server.sh
RUN chmod +x /home/run_production_server.sh
And, I have downloaded nltk data for 'punkt' and 'punkt_tab' using the latest 3.9.1 nltk.
And what does your requirements.txt look like?
My apologies. I had a package inside requirements.txt that install <3.9.1 version of nltk! Found it. Thanks again!
Closing it now.
Great! Glad you got it
Bug Description
I upgraded llama-index and llama-index-core to 0.11.8 and after the upgrade, when I ran the docker image, it threw the following error.
Version
0.11.8
Steps to Reproduce
Upgrade the version to 0.11.8. Make sure all requirements are installed correctly. Run the application docker image. Following error is thrown.
Relevant Logs/Tracbacks