langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.1k stars 14.97k forks source link

Issue: Chromadb document deletion not working #4880

Open anant-patankar opened 1 year ago

anant-patankar commented 1 year ago

Issue you'd like to raise.

I am trying to delete a single document from Chroma db using the following code:

chroma_db = Chroma(persist_directory = embeddings_save_path, embedding_function = OpenAIEmbeddings(model = os.getenv("EMBEDDING_MODEL_NAME"), chunk_size = 1, max_retries = 5) )

chroma_db._collection.delete(ids = list_of_ids) chroma_db.persist()

However, the document is not actually being deleted. After loading/re-loading the chroma db from local, it is still showing the document in it.

I have tried the following things to fix the issue:

I have made sure that the list of ids is correct. I have tried deleting the document multiple times. I have tried restarting the Chroma db server. None of these things have worked.

I am not sure why the document is not being deleted. I would appreciate any help in resolving this issue.

Thanks, Anant Patankar

Suggestion:

No response

votrumar commented 1 year ago

🤖 AI-generated response by Steercode - chat with Langchain codebase Disclaimer: SteerCode Chat may provide inaccurate information about the Langchain codebase.

It seems like you are trying to delete a document from the Chroma collection using the _collection.delete() method. However, the proper method to delete a document from the Chroma collection is delete_document(). You can use this method as follows:

chroma_db = Chroma(persist_directory=embeddings_save_path,
                  embedding_function=OpenAIEmbeddings(model=os.getenv("EMBEDDING_MODEL_NAME"),
                                                      chunk_size=1,
                                                      max_retries=5)
                  )

# Delete the document
for document_id in list_of_ids:
    chroma_db.delete_document(document_id)

# Persist the changes
chroma_db.persist()

Make sure to replace list_of_ids with the list of document IDs you want to delete. This should properly delete the documents from the Chroma collection.

You can continue the conversation at https://steercode.com/

qingtian1771 commented 1 year ago

I have the same question, @anant-patankar , did you solved the problem?

emorling commented 1 year ago

Ensure the list of ids are strings. def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> None:

corinnamk commented 11 months ago

The steercode solution does not work. It gives me the following error: AttributeError: 'Chroma' object has no attribute 'delete_document'

MahibArnob commented 9 months ago

Just replace function name delete_document with delete, there is no delete_document function in Chroma class. this is the code snippet of delete function inside Chroma class: def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> None: """Delete by vector IDs.

    Args:
        ids: List of ids to delete.
    """
    self._collection.delete(ids=ids)
ernkoch commented 9 months ago

Hi everyone, chiming in on this,

I tried what you suggested and used collection.delete( ids=collection.get().get('ids') ) but when looking at the SQLlite database, I can still see the entries in the table 'embedding_fulltext_search_data'.

I also tried: chroma_client.delete_collection("test") and the data is not shown with: collection.get() but it seems to still remains in the database (table: 'embedding_fulltext_search_data'), which eats up a lot of memory when deleting frequently.

Is there a way to complete remove ids and corresponding data from the database, or completely remove en entire collection?

Thank you, Cornelius

jeromejosephraj commented 8 months ago

Hi - has anyone found a solution yet? I'm facing the same issue.

giacomochiarella commented 7 months ago

Same issue here. I'm calling the endpoint api/v1/collections/ method DELETE but I get only delete the entry in collections table, all the documents, metadata and embedding_fulltext_search* are still in the sqlite database

MrAnayDongre commented 6 months ago

Reference the below code. This works after the update to chroma, where now it uses SQLite instead of duckdb. Based on the file_name the deletion will take place.

import chromadb
import os

from langchain.vectorstores import Chroma

persist_directory = "Database\\chroma_db\\"+"test3"
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)

# Get the Chroma DB object
chroma_db = chromadb.PersistentClient(path=persist_directory)
collection = chroma_db.get_collection(name="langchain")

# Get the metadata list
metadata_list = collection.get()['metadatas']
print(metadata_list)

# To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'
file_names = []
for metadata in metadata_list:
    filename = metadata['source'].split('\\')[-1]
    if filename not in file_names:
        file_names.append(filename)

print(file_names)

# the below will fetch the ids of the all the document chunks from this doc1 file name
print(collection.get(where={"source": "uploads\\temp_files\\doc1.pdf"})['ids'])

# the below deletes all the chunks from the doc1 file
collection.delete(
    where={"source": "uploads\\temp_files\\doc1.pdf"}
)
giacomochiarella commented 6 months ago

I have strong thoughts that will not delete documents store in embedding_fulltext_search* tables because in these tables there are no ids that would enable filtering via collection id nor document id. E.g. embedding_fulltext_search even have just one column which is the document itself, without any ids

MrAnayDongre commented 6 months ago

If you want to delete documents by IDs, consider the following code: It worked perfectly for me. Provide a document /file name; based on that, it will aggregate ID's associated with that file name and start deleting them.

from chromadb.config import Settings

persist_directory = "./testing"
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)

# Get the Chroma DB object
chroma_db = chromadb.PersistentClient(path=persist_directory, settings=Settings(allow_reset=True))
collection = chroma_db.get_collection(name="docs_store_v2")

# Function to delete documents by IDs
def delete_documents(ids):
    if ids:
        # Delete the documents by IDs
        collection.delete(ids=ids)
        print("Documents have been deleted from the collection.")
    else:
        print("No documents found with the given filename.")

# Function to prompt user for resetting the database
def reset_database(client):
    confirm = input("Do you want to reset the database? (y/n): ")
    if confirm.lower() == 'y':
        client.reset()
        print("Database has been reset.")
    else:
        print("Database reset cancelled.")

# Function to prompt user for deleting the collection
def delete_collection(client):
    confirm = input("Do you want to delete the collection? (y/n): ")
    if confirm.lower() == 'y':
        collection_name = collection.name
        client.delete_collection(name=collection_name)
        print(f"Collection '{collection_name}' has been deleted.")
    else:
        print("Collection deletion cancelled.")

# Get all documents in the collection
db_data = collection.get()

# Extract metadata
metadatas = db_data['metadatas']
ids = db_data['ids']

# Display all source file names present in the collection
print("Source file names present inside the collection:")
source_file_names = set(metadata.get('source') for metadata in metadatas)
for source_file_name in source_file_names:
    print("- " + source_file_name)

# Get the filename from the user
filename = input("\nEnter the filename you want to delete (e.g., 'example.txt'): ")

# Find document IDs with matching filename
ids_to_delete = [id for id, metadata in zip(ids, metadatas) if metadata.get('source') == filename]

# Delete the documents with matching IDs
delete_documents(ids_to_delete)

# Print the updated list of files in the collection
print("\nUpdated files in the collection:")
updated_metadata_list = collection.get()['metadatas']
updated_file_names = set(metadata.get('source') for metadata in updated_metadata_list)
for updated_file_name in updated_file_names:
    print("- " + updated_file_name)

# Ask the user if they want to delete the collection
delete_collection(chroma_db)

# Ask the user if they want to reset the database
reset_database(chroma_db)
dieharders commented 5 months ago

Hello I am also worried about this bug as well. I have followed the above to remove my documents collection.delete(ids=ids) but I am still seeing db data as well as the folder that is created when embeddings are made (has files like: data_level0.bin, link_lists.bin, i dont know what these are).

I am noticing embedding data in embedding_fulltext_search etc records with plain text just being left behind. All the other document data has been removed, but this still remains. Also the folder with db files remains.

Everything else in the db seems to be removed successfully except these two things. -Edit, Forgot to mention I am on v0.4.24. I am also using this in conjunction with Llama-index if that makes any diff. -Edit , Sorry I just realized this is the langchain repo and not the ChromaDB. I will seek assistance there.

ALIYoussef commented 5 months ago

any update? it seems that chroma DB still include deleted data or keeping them as None value!

maurovitaleBH commented 5 months ago

@ALIYoussef same problem here. After I delete a document and get relevant documents, i got different documents as None and they correspond to the old deleted documents

chrispy-snps commented 2 months ago

This is a serious issue for us. We are trying to delete outdated documents and replace them with updated documents in an active vector store.

The following bug.py script deletes and adds 5000 documents to a vector store:

#!/usr/bin/env python
import chromadb
import numpy as np
import subprocess
import time

NUM_DOCS = 5_000
EMBEDDING_SIZE = 1000
VS_PATH = "./vs_test"

# disk usage in human readable format (e.g. '2,1GB')
du = lambda path: subprocess.check_output(["du", "-sh", path]).split()[0].decode("utf-8")

# create/open the vector store
client = chromadb.PersistentClient(VS_PATH)
collection = client.get_or_create_collection(name="test")

# delete existing documents
ids = collection.get()["ids"]
print(f"Deleting {len(ids)} existing documents...")
start_time = time.time()

if ids:
    collection.delete(ids=ids)

print(f"{collection.count()} documents after deletion.")
end_time = time.time()
print(f"Document deletion runtime: {round(end_time - start_time)} seconds")

# add new documents
ids = [str(id) for id in range(NUM_DOCS)]
embeddings = [list(np.random.normal(size=EMBEDDING_SIZE)) for id in ids]
print(f"Adding {len(ids)} documents...")
start_time = time.time()

collection.upsert(ids=ids, documents=ids, embeddings=embeddings)

end_time = time.time()
print(f"{collection.count()} documents after addition.")
print(f"Document addition runtime: {round(end_time - start_time)} seconds")

# print on-disk size
print(f"Vector store size: {du('./vs_test')}")
print("")

When I run this five times:

rm -rf ./vs_test/ && \
  ./bug.py && \
  ./bug.py && \
  ./bug.py && \
  ./bug.py && \
  ./bug.py

I get increasing addition runtimes and on-disk vector store sizes:

Deleting 0 existing documents...
0 documents after deletion.
Document deletion runtime: 0 seconds
Adding 5000 documents...
5000 documents after addition.
Document addition runtime: 8 seconds
Vector store size: 44M

Deleting 5000 existing documents...
0 documents after deletion.
Document deletion runtime: 2 seconds
Adding 5000 documents...
5000 documents after addition.
Document addition runtime: 9 seconds
Vector store size: 86M

Deleting 5000 existing documents...
0 documents after deletion.
Document deletion runtime: 2 seconds
Adding 5000 documents...
5000 documents after addition.
Document addition runtime: 19 seconds
Vector store size: 129M

Deleting 5000 existing documents...
0 documents after deletion.
Document deletion runtime: 2 seconds
Adding 5000 documents...
5000 documents after addition.
Document addition runtime: 27 seconds
Vector store size: 171M

Deleting 5000 existing documents...
0 documents after deletion.
Document deletion runtime: 2 seconds
Adding 5000 documents...
5000 documents after addition.
Document addition runtime: 30 seconds
Vector store size: 214M

On our production vector store with ~55k documents, the document addition time grew to 11 minutes and the on-disk size grew to 4.2 GB after several deletion/addition cycles.

We're using Chroma 0.5.4 and SQLite3 3.39.4.

jczic commented 1 month ago

Hello all!

I had the same problem in production and it was very serious for our company!

We add collections with many vectors/documents and update them very often. The problem is that if you take a closer look at the SQLite3 database, all the deleted information with deleted links (foreign keys) keeps adding up, but the DB keeps getting bigger and bigger. In a short space of time we reached over 13 GB in the ChromaDB database folder and the server memory was exploding!

I found a strange, and temporary, solution by testing numerous solutions.... Here it is:

            ids = chromaColl.get()['ids']
            if ids :
                chromaColl.delete(ids)
            del chromaColl
            _chromadb.delete_collection(collectionName)

Why is it absolutely necessary to call these 2 deletions in order to empty the data correctly?

Thank you !

chrispy-snps commented 1 month ago

@jczic - thanks very much for sharing this! Does this allow the documents to be deleted and refreshed while there are active connections (with the understanding that those connections have a brief window of reduced data)?

jczic commented 1 month ago

@chrispy-snps I don't know, but in any case it allows you to delete a collection correctly. You need to recreate it afterwards if you want to update everything.

You can also try opening the chroma.sqlite3 files with this great open-source software: SQLiteBrowser.ORG to browse the data organised by ChromaDB :)

jczic commented 1 month ago

This also seems to stem from the way Chroma is used, particularly in multithreaded/asynchronous mode. (see https://github.com/chroma-core/chroma/issues/1908)

For my part, I don't use asyncio or FastAPI in Python, but I'm in production with MicroWebSrv2 🙂