langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.28k stars 14.74k forks source link

Calling Chroma.from_documents() returns sqlite3.OperationalError: attempt to write a readonly database, but only sometimes #14872

Open gracewzhang opened 9 months ago

gracewzhang commented 9 months ago

System Info

Platform: Ubuntu 22.04 Python: 3.11.6 Langchain: 0.0.351

Who can help?

No response

Information

Related Components

Reproduction

When the program is first initialized with __setup_client() and __should_reingest() returns True, __get_new_client() works as intended. However, if reingest() is called afterward, __get_new_client() returns the error below.

Relevant code:

def __setup_client(self) -> None:
    if self.__should_reingest():
        self.db = self.__get_new_client()
    else:
        self.db = self.__get_existing_client()

def reingest(self) -> None:
     self.db = self.__get_new_client()

 def __get_new_client(self):
    if os.path.exists(self.persist_directory):
        shutil.rmtree(self.persist_directory)

    docs = self.__get_docs()
    client = Chroma.from_documents(
        docs, self.embedding_function, persist_directory=self.persist_directory)

    with open(f'{self.persist_directory}/date.txt', 'w') as f:
        f.write(f'{datetime.date.today()}')
    return client

Error:

Traceback (most recent call last):
  ...
  File line 26, in reingest
    self.cauldron.reingest()
  File line 19, in reingest
    self.db = self.__get_new_client()
              ^^^^^^^^^^^^^^^^^^^^^^^
  File line 51, in __get_new_client
    client = Chroma.from_documents(
             ^^^^^^^^^^^^^^^^^^^^^^
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/langchain_community/vectorstores/chroma.py", line 771, in from_documents
    return cls.from_texts(
           ^^^^^^^^^^^^^^^
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/langchain_community/vectorstores/chroma.py", line 729, in from_texts
    chroma_collection.add_texts(
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/langchain_community/vectorstores/chroma.py", line 297, in add_texts
    self._collection.upsert(
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 459, in upsert
    self._client._upsert(
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 127, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/chromadb/api/segment.py", line 446, in _upsert
    self._producer.submit_embeddings(coll["topic"], records_to_submit)
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 127, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File ".../.pyenv/versions/3.11.6/lib/python3.11/site-packages/chromadb/db/mixins/embeddings_queue.py", line 172, in submit_embeddings
    results = cur.execute(sql, params).fetchall()
              ^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: attempt to write a readonly database

Expected behavior

No error returned

dosubot[bot] commented 9 months ago

🤖

Hey there, @gracewzhang! 👋 I'm Dosu, a friendly bot designed to assist with bugs, answer questions, and help you get going with contributions to LangChain. While we wait for a human maintainer, feel free to ask me anything I might be able to assist with. Let's get this sorted out together!

The error you're encountering, sqlite3.OperationalError: attempt to write a readonly database, typically occurs when the application does not have write permissions to the database file or directory. In the context of LangChain and the Chroma vector store, this could happen if the persist_directory specified during the initialization of the Chroma instance is not writable by the application.

In your code, you're removing the persist_directory and then immediately trying to write to it. This could potentially lead to a race condition where the directory is not fully set up before you're trying to write to it.

if os.path.exists(self.persist_directory):
    shutil.rmtree(self.persist_directory)

You might want to add a small delay or a check to ensure the directory is fully deleted before proceeding. Alternatively, you could change the permissions of the persist_directory to ensure it's writable by your application.

If you're still encountering the issue after checking these potential causes, it would be helpful to know more about the environment in which you're running your application. Are there any other processes that might be accessing the same persist_directory? Are you able to consistently reproduce this error, or does it only occur intermittently?

As for the __get_new_client() function, based on the provided code, it seems to be a helper function used to create a new Chroma client. It first checks if the persist_directory exists and removes it if it does. Then it retrieves the documents, creates a new Chroma client with these documents and the specified embedding_function, and finally writes the current date to a file in the persist_directory. This function is used in the __setup_client() and reingest() methods to set up or refresh the Chroma client.

Please let me know if you need further assistance.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

gracewzhang commented 9 months ago

Tried adding a check to see if self.persist_directory still exists before calling Chroma.from_documents() (it doesn't) + created the directory and manually set the permissions for it, but neither fixed the issue.

umair313 commented 9 months ago

@gracewzhang whats your chroma db version?

gracewzhang commented 9 months ago

@umair313 0.4.18

umair313 commented 9 months ago

@umair313 0.4.18

Try with 0.4.17 or 15. Make sure your internet is good. Change the name of persistence director name. Are you using notebook?

gracewzhang commented 9 months ago

@umair313 0.4.18

Try with 0.4.17 or 15. Make sure your internet is good. Change the name of persistence director name. Are you using notebook?

Just tried with both 0.4.17 & 0.4.15, plus changed the name of the persistence directory name, and I'm still running into the same issue. Not using notebook either. It's weird bc calling __get_new_client() from setup_client() works, but calling it from reingest() doesn't work. I printed the directory at the start of each call and it's the same too.

Bengt commented 8 months ago

This issue popped up out of nowhere and to fix it, I had to downgrade ChromaDB to 0.4.14:

venv/bin/python -m pip install --upgrade chromadb==0.4.14

This is weird, because I was definitely using ChromaDB later than the version from October the 10th for some time and the issue only occurred recently:

https://pypi.org/project/chromadb/#history

Maybe this is caused by a transitive dependency of ChromaDB being upgraded in the meantime?

gracewzhang commented 8 months ago

This issue popped up out of nowhere and to fix it, I had to downgrade ChromaDB to 0.4.14:

venv/bin/python -m pip install --upgrade chromadb==0.4.14

This is weird, because I was definitely using ChromaDB later than the version from October the 10th for some time and the issue only occurred recently:

https://pypi.org/project/chromadb/#history

Maybe this is caused by a transitive dependency of ChromaDB being upgraded in the meantime?

This did the trick, thanks!

tadeodonegana commented 7 months ago

This issue popped up out of nowhere and to fix it, I had to downgrade ChromaDB to 0.4.14:

venv/bin/python -m pip install --upgrade chromadb==0.4.14

This is weird, because I was definitely using ChromaDB later than the version from October the 10th for some time and the issue only occurred recently:

https://pypi.org/project/chromadb/#history

Maybe this is caused by a transitive dependency of ChromaDB being upgraded in the meantime?

Just ran into the same error and this solved the issue. Thanks!

varunvohra94 commented 7 months ago

This issue popped up out of nowhere and to fix it, I had to downgrade ChromaDB to 0.4.14:

venv/bin/python -m pip install --upgrade chromadb==0.4.14

This is weird, because I was definitely using ChromaDB later than the version from October the 10th for some time and the issue only occurred recently:

https://pypi.org/project/chromadb/#history

Maybe this is caused by a transitive dependency of ChromaDB being upgraded in the meantime?

Just ran into this issue and this fixed it

carlos-chinchilla commented 6 months ago

Just ran into this issue and installing the 0.4.14 version didn't fix it...

Saniya327 commented 6 months ago

Ran into this issue and installing the 0.4.14 version didn't fix it. Any more ideas?

pseudotensor commented 6 months ago

Seeing this too sometimes.

fahaisouxun commented 5 months ago

I found that when I create a Chroma database with a new name the first time, it works. But if I delete the database directory from my Google Drive filesystem and try to recreate a database with the same name, I get this error when I try to add documents to it. Did anyone else notice the same kind of behaviour? For context, I am using Google Colab notebook which writes to Chroma database saved on my Google Drive.

moaaztaha commented 5 months ago

I found that when I create a Chroma database with a new name the first time, it works. But if I delete the database directory from my Google Drive filesystem and try to recreate a database with the same name, I get this error when I try to add documents to it. Did anyone else notice the same kind of behaviour? For context, I am using Google Colab notebook which writes to Chroma database saved on my Google Drive.

I encountered the same issue locally on version 0.4.24, but downgrading to 0.4.14 resolved it.

Fuehnix commented 5 months ago

I traced this issue down to some funky stuff going on in the sqlite3 backend. It seems to be an issue with whenever you do persist directory to recreate a stored vectorstore and running multiple times.

tl;dr, restart your jupyter notebook and do what you can to clear anything that might be causing an "active data connection"

chroma version 0.4.24 works for me, this was not the problem (at least for me).

my code:

def create_index_from_documents(collection_name, embedding_model, persist_directory, all_docs: List[Document], clear_persist_folder: bool = True):
    if clear_persist_folder:
        pf = Path(persist_directory)
        if pf.exists() and pf.is_dir():
            print(f"Deleting the content of: {pf}")
            shutil.rmtree(pf)
        pf.mkdir(parents=True, exist_ok=True)
        print(f"Recreated the directory at: {pf}")
    print("Generating and persisting the embeddings..")
    print(persist_directory)
    vectordb = Chroma.from_documents(
        collection_name = collection_name,
        documents=all_docs, 
        embedding=embedding_model, 
        persist_directory=persist_directory  # type: ignore
    )
    vectordb.persist()
    return vectordb

some simpler code which should also work:

recreate_db = False
persist_directory = "./chroma_db"
t1_start = time.perf_counter()
if recreate_db:
    vectorstore = Chroma.from_documents(
        collection_name=collection_name, documents=docs, embedding=embed_model, persist_directory=persist_directory)
    vectorstore.persist()
else:
    vectorstore = Chroma(collection_name=collection_name, persist_directory=persist_directory, embedding_function=embed_model)
t1_stop = time.perf_counter()  
print("elapsed time:", t1_stop-t1_start)

Be sure that when you're experimenting with fixing this problem that you restart your jupyter notebook. I believe this issue may be a problem of trying to write to it while there is still an active connection to the DB (even if you deleted the DB, maybe something in your jupyter notebook still looking there?), which is what gives it the read-only access

jtlz2 commented 3 months ago

Seeing this too sometimes.

Only sometimes?

jtlz2 commented 3 months ago

@Fuehnix You are right. Thank you!

Worked for me. Do you think one could close all active connections programmatically ahead of attempting the persist?

Thanks again

jtlz2 commented 3 months ago

@Fuehnix Except now I run into trying to append to a partially-populated db... (entirely separate issue)

boulbi777 commented 3 months ago

I just updated Chroma to the latest version and and the error disappeared for me (0.4.22 -> 0.5.0).

$ poetry add chromadb@latest.

Feel free to update with pip if you're not using poetry.

jonz-tech commented 3 months ago

after a half day reading the code, :( here is the correct answer about rebuild chroma database!

'version: chromadb==0.5.0'

  1. create chroma client and collection:

def create_client(self):   
        if self.db_client is None:
            local_db_path = self.get_db_file_path()
            settings = Settings()
            settings.persist_directory = local_db_path
            settings.is_persistent = True
            settings.allow_reset = True
            self.db_client = chromadb.Client(settings=settings)
            print(f"create db_client: {self.db_client}")

    def create_collection(self):
        self.db_collection = self.db_client.get_or_create_collection(name=self.db_name)
        print(f"create db_collection:{self.db_collection}")
  1. delete the folder and cache in correct way:

        if self.db_collection:
            self.db_client.delete_collection(name=self.db_name)
            self.db_collection = None
            print('delete_collection success')

        if self.db_client:
            result = self.db_client.reset()
            self.db_client.clear_system_cache() # very important
            self.db_client = None
            print(f"remove and reset db_client success: {result}")

       # delete you persist_directory and create persist_directory againt

then, call step 1 to rebuild chroma.sqlite. maybe add a litte delay in step1 is much better!

niraj-khatiwada commented 3 months ago

I'm getting this issue as well but it happens randomly. If I try to create the dataset with a persistent directory that was already created, but it was removed manually via shutil in Python, I get this error. But if I keep trying it works after few tries.

EDIT: Downgrading to ==0.4.14 seems to fix it.

gnumoksha commented 2 months ago

I got the same error in a Jupyter notebook and renaming the persistence directory did work.

chrispy-snps commented 1 month ago

I hit this bug when writing a vector store to a temporary directory before renaming it to its final directory. Here's the code:

#!/usr/bin/env python
import chromadb
import numpy as np
import os

def write_vs(
    documents: list[str],
    output_dir: str,
):
    print(f"Preparing to create vector store at '{output_dir}'...")

    # create the new vector store at a temporary path
    # (so nobody accesses it until it's ready)
    new_temp_dir = "vs_temp"
    print(f"Creating temporary vector store at '{new_temp_dir}' (to be renamed to '{output_dir}')...")
    vs_db = chromadb.PersistentClient(new_temp_dir)
    collection = vs_db.get_or_create_collection(name="test")

    # write "documents" to vector store
    ids = documents
    embeddings = [list(np.random.normal(size=16)) for doc in documents]
    print(f"Adding {len(ids)} documents to '{new_temp_dir}'...")
    collection.upsert(ids=ids, documents=ids, embeddings=embeddings)

    # rename the temporary vector store to its final path
    os.rename(new_temp_dir, output_dir)
    print(f"Renamed '{new_temp_dir}' to '{output_dir}'.")
    print("")

# write vector store #1 to vs_1/
write_vs(documents=["abc"], output_dir="vs_1")

# write vector store #2 to vs_2/
write_vs(documents=["xyz"], output_dir="vs_2")

When the scripts attempts to write the second vector store to the same temporary directory vs_temp (even though there's no such directory any more), the following error occurs:

$ rm -rf vs_*; bug.py
Getting ready to create vector store at 'vs_1'...
Creating temporary vector store at 'vs_temp' (to be renamed to 'vs_1')...
Adding 1 documents to 'vs_temp'...
Renamed 'vs_temp' to 'vs_1'.

Getting ready to create vector store at 'vs_2'...
Creating temporary vector store at 'vs_temp' (to be renamed to 'vs_2')...
Adding 1 documents to 'vs_temp'...
Traceback (most recent call last):
  File "/path/to/bug.py", line 34, in <module>
    write_vs(documents=["xyz"], output_dir="vs_2")
  ...omitted...
  File "/path/to/python3.12/site-packages/chromadb/db/mixins/embeddings_queue.py", line 180, in submit_embeddings
    results = cur.execute(sql, params).fetchall()
              ^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: attempt to write a readonly database