run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.78k stars 5.27k forks source link

[Bug]: Not respecting LLAMA_INDEX_CACHE_DIR environment variable #10720

Closed Mazzesy closed 5 months ago

Mazzesy commented 9 months ago

Bug Description

According to the documentation (Link), I can control the location where additional data is downloaded by setting the LLAMA_INDEX_CACHE_DIR environment variable. However, despite setting this environment variable, LlamaIndex seems to ignore it and continues to store data in a different location.

Version

0.10.4

Steps to Reproduce

Here's how I'm setting the environment variable in my Python script:

import os
os.environ["LLAMA_INDEX_CACHE_DIR"] = "/path/to/my/cache/directory"

When creating the index storage (see code below), nltk_data gets downloaded to /Users/user/nltk_data instead of the path I set as the environment variable.

loader = UnstructuredReader()
doc = loader.load_data(file=Path(file), split_documents=False)
storage_context = StorageContext.from_defaults()
cur_index = VectorStoreIndex.from_documents(doc, storage_context=storage_context)
storage_context.persist(persist_dir=f"./storage/name")

Relevant Logs/Tracbacks

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
semoal commented 9 months ago

https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/utils.py#L48

Looks it has been replaced by NLTK_DATA

@logan-markewich is it expectd?

dosubot[bot] commented 6 months ago

Hi, @Mazzesy,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. The reported issue involved the LLAMA_INDEX_CACHE_DIR environment variable not being respected by LlamaIndex, despite following the documentation. User semoal suggested that the variable has been replaced by NLTK_DATA, and the maintainers, including logan-markewich, were asked if this replacement is expected.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or need assistance, feel free to reach out to me or the LlamaIndex team.

I