run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.23k stars 4.95k forks source link

StorageContext can't init with persist_dir #3734

Closed madawei2699 closed 12 months ago

madawei2699 commented 1 year ago

This is my code:

from llama_index import StorageContext, load_index_from_storage

index_cache_web_dir = Path('/tmp/cache_web/')

if not index_cache_web_dir.is_dir():
    index_cache_web_dir.mkdir(parents=True, exist_ok=True)

web_storage_context = StorageContext.from_defaults(persist_dir=str(index_cache_web_dir))

Then it will raise a error, the error log is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/llama_index/storage/storage_context.py", line 61, in from_defaults
    docstore = docstore or SimpleDocumentStore.from_persist_dir(
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/llama_index/storage/docstore/simple_docstore.py", line 51, in from_persist_dir
    return cls.from_persist_path(persist_path, namespace=namespace, fs=fs)
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/llama_index/storage/docstore/simple_docstore.py", line 69, in from_persist_path
    simple_kvstore = SimpleKVStore.from_persist_path(persist_path, fs=fs)
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/llama_index/storage/kvstore/simple_kvstore.py", line 75, in from_persist_path
    with fs.open(persist_path, "rb") as f:
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/fsspec/spec.py", line 1199, in open
    f = self._open(
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/fsspec/implementations/local.py", line 183, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/fsspec/implementations/local.py", line 314, in __init__
    self._open()
  File "/opt/miniconda3/envs/py310/lib/python3.10/site-packages/fsspec/implementations/local.py", line 319, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/cache_web/docstore.json'

The Python version is 3.10, the llama-index is 0.6.9, and the langchain is 0.0.154.

bot403 commented 1 year ago

I had the same issue. It seems like a lack of documentation or full examples problem. However, I would second a motion that makes this easier and more intuitive by allowing and init-ing a blank persistence directory.

I solved it by first creating the index without the persist_dir argument or storage context. Then I called index.storage_context.persist(persist_dir=persist_directory) and it created the required json files.

Once created I could load it using index = load_index_from_storage(storage_context)

dosubot[bot] commented 1 year ago

Hi, @madawei2699! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue where initializing StorageContext with a persist directory raises a FileNotFoundError because the directory does not exist. The user bot403 suggested a workaround by first creating the index without the persist directory argument and then calling index.storage_context.persist(persist_dir=persist_directory) to create the required JSON files. This solution was well-received by you, norbert-liki, and wilmerhenao.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to LlamaIndex!

lakinmindfire commented 11 months ago

Found same issue

jscastanoc commented 11 months ago

@bot403 thanks!

@lakinmindfire my current implementation of the workaround


from llama_index.storage.index_store.simple_index_store import DEFAULT_PERSIST_FNAME as INDEX_STORE_FNAME   

if (STORAGE_PERSIST_DIR / INDEX_STORE_FNAME).exists():  
    storage_context = StorageContext.from_defaults(
            persist_dir=STORAGE_PERSIST_DIR
        )
    index = load_index_from_storage(storage_context=storage_context)
else:
    if not STORAGE_PERSIST_DIR.exists():
        STORAGE_PERSIST_DIR.mkdir(parents=True)

    storage_context = StorageContext.from_defaults()
    index = VectorStoreIndex.from_documents(documents,
                                            storage_context=storage_context,
                                            show_progress=True)

    storage_context.persist(persist_dir=STORAGE_PERSIST_DIR)

hope that helps

HenryCWong commented 10 months ago

@jscastanoc Where is STORAGE_PERSIST_DIR coming from?

jdcaballerov commented 9 months ago

This should be reopened.

sil2193 commented 8 months ago

Pasting this here so that it might help someone,

TOGETHER_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

from llama_index.llms import OpenAILike
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, load_index_from_storage, StorageContext
import os.path

os.makedirs("storage", exist_ok=True)
os.makedirs("input_data", exist_ok=True)

# for openrouter see old file
llm = OpenAILike(
    model                     = "mistralai/Mixtral-8x7B-Instruct-v0.1",
    api_base                  = "https://api.together.xyz/v1",
    api_key                   = TOGETHER_API_KEY,
    is_chat_model             = True,
    is_function_calling_model = True,
    temperature               = 0.1,
    api_version               = "v1",
    max_tokens                = 1164,
)

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local")

# code inside some function
# the input_dir is the dir in which files are stored that we want to read, and the store_dir is the place where we want to persist the embeddings
# what I was missing was adding service_context while loading

def some_function():
    # variables declared

    if os.path.exists(store_dir):
        # load the existing index
        storage_context = StorageContext.from_defaults(persist_dir=store_dir)
        index = load_index_from_storage(
            storage_context, 
            service_context = service_context,
        )

    else:
        # load the documents and create the index
        documents = SimpleDirectoryReader(input_dir).load_data()
        index = VectorStoreIndex.from_documents(
            documents, 
            service_context = service_context
        )
        # store it for later
        index.storage_context.persist(persist_dir=store_dir)

    # either way we can now query the index
    query_engine = index.as_query_engine()
    response = query_engine.query(question)

what everyone was missing was passing the service context when reading the data.

hope this helps someone get a sleep, its 6 in the morning and I havent slept today :(