run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.24k stars 5.17k forks source link

[Bug]: llama-index-core cannot find nltk punkt_tab data #15944

Closed mballav closed 1 month ago

mballav commented 1 month ago

Bug Description

I upgraded llama-index and llama-index-core to 0.11.8 and after the upgrade, when I ran the docker image, it threw the following error.

Version

0.11.8

Steps to Reproduce

Upgrade the version to 0.11.8. Make sure all requirements are installed correctly. Run the application docker image. Following error is thrown.

Relevant Logs/Tracbacks

docsearch-worker-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/schema.py", line 27, in <module>
docsearch-worker-1  |     from llama_index.core.utils import SAMPLE_TEXT, truncate_text
docsearch-worker-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/utils.py", line 89, in <module>
docsearch-worker-1  |     globals_helper = GlobalsHelper()
docsearch-worker-1  |                      ^^^^^^^^^^^^^^^
docsearch-worker-1  |   File "/usr/local/lib/python3.11/site-packages/llama_index/core/utils.py", line 65, in __init__
docsearch-worker-1  |     nltk.data.find("tokenizers/punkt_tab")
docsearch-worker-1  |   File "/usr/local/lib/python3.11/site-packages/nltk/data.py", line 539, in find
docsearch-worker-1  |     return FileSystemPathPointer(p)
docsearch-worker-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^
docsearch-worker-1  |   File "/usr/local/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator
docsearch-worker-1  |     return init_func(*args, **kwargs)
docsearch-worker-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^
docsearch-worker-1  |   File "/usr/local/lib/python3.11/site-packages/nltk/data.py", line 315, in __init__
docsearch-worker-1  |     raise IOError("No such file or directory: %r" % _path)
docsearch-worker-1  | OSError: No such file or directory: '/root/nltk_data/tokenizers/punkt/PY3_tab'
dosubot[bot] commented 1 month ago

To resolve the "No such file or directory: '/root/nltk_data/tokenizers/punkt/PY3_tab'" error after upgrading llama-index and llama-index-core to version 0.11.8, you need to ensure that the NLTK data, specifically the punkt tokenizer, is properly downloaded and accessible. The GlobalsHelper class in the code handles this by downloading the necessary NLTK data if it is not found.

Here is the relevant part of the code that ensures the punkt tokenizer is downloaded:

class GlobalsHelper:
    """Helper to retrieve globals.

    Helpful for global caching of certain variables that can be expensive to load.
    (e.g. tokenization)

    """

    _stopwords: Optional[List[str]] = None
    _nltk_data_dir: Optional[str] = None

    def __init__(self) -> None:
        """Initialize NLTK stopwords and punkt."""
        import nltk

        self._nltk_data_dir = os.environ.get(
            "NLTK_DATA",
            os.path.join(
                os.path.dirname(os.path.abspath(__file__)),
                "_static/nltk_cache",
            ),
        )

        if self._nltk_data_dir not in nltk.data.path:
            nltk.data.path.append(self._nltk_data_dir)

        # ensure access to data is there
        try:
            nltk.data.find("corpora/stopwords", paths=[self._nltk_data_dir])
        except LookupError:
            nltk.download("stopwords", download_dir=self._nltk_data_dir)

        try:
            nltk.data.find("tokenizers/punkt", paths=[self._nltk_data_dir])
        except LookupError:
            nltk.download("punkt", download_dir=self._nltk_data_dir)

To resolve the error, you can manually download the punkt tokenizer by running the following Python code:

import nltk

nltk.download('punkt')

Alternatively, you can set the NLTK_DATA environment variable to a directory where NLTK data can be downloaded and accessed:

import os
import nltk

# Set the NLTK_DATA environment variable
os.environ['NLTK_DATA'] = '/path/to/your/nltk_data'

# Ensure the directory exists
if not os.path.exists(os.environ['NLTK_DATA']):
    os.makedirs(os.environ['NLTK_DATA'])

# Download the punkt tokenizer
nltk.download('punkt', download_dir=os.environ['NLTK_DATA'])

Replace '/path/to/your/nltk_data' with the actual path where you want to store the NLTK data. This should resolve the "No such file or directory" error [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

@mballav what does your dockerfile look like? How are you installing things? I've been running the latest in docker for a while and have never seen this

mballav commented 1 month ago

Here you go @logan-markewich. Thanks a lot for the prompt reply.

FROM quaeris/docsearch-ma:multiarch2 as local-development

COPY src/requirements.txt requirements.txt
RUN pip install -r requirements.txt

COPY ./src/ssi.txt ssi.txt
RUN pip install -r ssi.txt
RUN rm ssi.txt

RUN pip install PyJWT[crypto]==2.8.0
COPY ./nltk_data /root/nltk_data

WORKDIR /home
COPY ./src/run_production_server.sh ./run_production_server.sh
RUN chmod +x /home/run_production_server.sh

And, I have downloaded nltk data for 'punkt' and 'punkt_tab' using the latest 3.9.1 nltk.

logan-markewich commented 1 month ago

And what does your requirements.txt look like?

mballav commented 1 month ago

My apologies. I had a package inside requirements.txt that install <3.9.1 version of nltk! Found it. Thanks again!

mballav commented 1 month ago

Closing it now.

logan-markewich commented 1 month ago

Great! Glad you got it