langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.65k stars 15.32k forks source link

Deletion issue in Chroma Vectorstore & its info on seeing doc id and embeddinsg stored #4519

Closed prateekaffine closed 9 months ago

prateekaffine commented 1 year ago

System Info

I was able to somehow fetch the document chunk id's from chroma db, but I did not get how can I delete a specific document using its document name or document id. I have gone thru all references did not find a solution for it. Insertion, Updation is there but not deletion. Can you please help on this?

Extract id of document chunks in vector database

chroma_db.get()['ids']

Also, I had a doubt when we pass embeddings in Chroma, why we see embeddings = None when I run chroma_db.get() I could only see document chunks and their id when I tried displaying the info inside, so why are we not seeing embedding for each document chunk here?

@hwchase17, it would be great if you can help on this, Thanks in advance.

Who can help?

@hwchase17

Information

Related Components

Reproduction

from langchain.document_loaders import DirectoryLoader from langchain.embeddings import OpenAIEmbeddings

documents_pdf = DirectoryLoader(directory_path, glob="*/.txt").load()

openai_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002") chroma_db = Chroma.from_documents(doc_chunks, embedding = openai_embeddings ) chroma_db.get()

all id's in chroma db:

list_of_chunk_ids = chroma_db.get()['ids'] print(len(list_of_chunk_ids))

Expected behavior

  1. Solution on deletion
  2. What all elements are stored in Chroma DB
9akashnp8 commented 1 year ago

hey @prateekaffine , the Collection object does support deletion by ids. You could implement a custom Chroma vectorstore that has this functionality.

Eg:

from langchain.vectorstores import Chroma

class CustomChroma(Chroma):

    def delete(self, ids, where, where_document):
        ids = validate_ids(maybe_cast_one_to_many(ids)) if ids else None
        where = validate_where(where) if where else None
        where_document = validate_where_document(where_document) if where_document else None
        return self._client._delete(self.name, ids, where, where_document)

Reference: https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py#L306

prateekaffine commented 1 year ago

@9akashnp8 Thanks for sharing this, I will try this.

@hwchase17 Also, I was checking the embeddings are None in the vectorstore using this operatioon any idea why? or some wrong is there the way I am doing it

vectorstore = Chroma.from_documents(doc_chunks, embeddings=OpenAIEmbeddings(model="text-embedding-ada-002", chunk_size=1), collection_name="my_collection", persist_directory="my_embedding_path")

Screenshot 2023-05-11 220328

matardy commented 1 year ago

If you have different collection for each of you users. Lets say you have collection-1 and collection-2:

At the end in each collection you are going to have different chunks with uniques ID.

Now user 2 wants to delete doc3.pdf and of course the embeddings from doc3.pdf must be deleted, chroma does not allow you to delete embeddings from a document ID, so to fix this you need to store in a database those IDs, then you are going to have in a database the ID for doc3.pdf and its chunks ids.

If the users delete doc3.pdf you would know its id, and with that id you can access to chunks id, and with chunks id you can delete those chunks from your collection.

Thats the way I solved that problem.

ankitku92 commented 1 year ago

I used the below code to delete from chroma using the document names. Since document path was getting stored in metadata. i was testing different chroma functions, so you will find other things as well in the code. Look for yourself what you need. Note: I had used Chroma.from_documents to load the chunks to chroma This works after the update to chroma where now it uses SQLite instead of duckdb. have not tested in the old version. I tested if the document was deleted using the method to fetch all the filenames given below and it actually had removed the chunks of that file, so i know it works.

import chromadb
import os

from langchain.vectorstores import Chroma

persist_directory = "Database\\chroma_db\\"+"test3"
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)

# Get the Chroma DB object
chroma_db = chromadb.PersistentClient(path=persist_directory)
collection = chroma_db.get_collection(name="langchain")

# Get the metadata list
metadata_list = collection.get()['metadatas']
print(metadata_list)

# To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'
file_names = []
for metadata in metadata_list:
    filename = metadata['source'].split('\\')[-1]
    if filename not in file_names:
        file_names.append(filename)

print(file_names)

# the below will fetch the ids of the all the document chunks from this doc1 file name
print(collection.get(where={"source": "uploads\\temp_files\\doc1.pdf"})['ids'])

# the below deletes all the chunks from the doc1 file
collection.delete(
    where={"source": "uploads\\temp_files\\doc1.pdf"}
)
raghujhts13 commented 1 year ago

@ankitku92 thank you. This worked for me too.

virdi16 commented 1 year ago

@ankitku92 thanks for sharing, it works

keiru517 commented 1 year ago

Thanks, @ankitku92 The code is working very well Let's keep in touch

deepak-habilelabs commented 1 year ago

I used the below code to delete from chroma using the document names. Since document path was getting stored in metadata. i was testing different chroma functions, so you will find other things as well in the code. Look for yourself what you need. Note: I had used Chroma.from_documents to load the chunks to chroma This works after the update to chroma where now it uses SQLite instead of duckdb. have not tested in the old version. I tested if the document was deleted using the method to fetch all the filenames given below and it actually had removed the chunks of that file, so i know it works.

import chromadb
import os

from langchain.vectorstores import Chroma

persist_directory = "Database\\chroma_db\\"+"test3"
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)

# Get the Chroma DB object
chroma_db = chromadb.PersistentClient(path=persist_directory)
collection = chroma_db.get_collection(name="langchain")

# Get the metadata list
metadata_list = collection.get()['metadatas']
print(metadata_list)

# To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'
file_names = []
for metadata in metadata_list:
    filename = metadata['source'].split('\\')[-1]
    if filename not in file_names:
        file_names.append(filename)

print(file_names)

# the below will fetch the ids of the all the document chunks from this doc1 file name
print(collection.get(where={"source": "uploads\\temp_files\\doc1.pdf"})['ids'])

# the below deletes all the chunks from the doc1 file
collection.delete(
    where={"source": "uploads\\temp_files\\doc1.pdf"}
)

def delete_document_embeddings_by_filename(file_path, persist_directory): chroma_db = chromadb.PersistentClient(path=persist_directory) print(chroma_db) collection = chroma_db.get_collection(name="langchain") print(collection) collection.delete(where={"source": file_path}) output of the above code is:- <chromadb.api.segment.SegmentAPI object at 0x7f4948165280> name='langchain' id=UUID('8a5e8fff-93a4-49f3-8be7-5aac47cb3902') metadata=None And I am calling like this persist_directory=f'/home/hs/CustomBot/chroma-databases/{formatted_project_name}' file=/home/hs/CustomBot/media/project/Code_of_Conduct_Policy.pdf delete_document_embeddings_by_filename(file, persist_directory) Still not able to delete embeddings of a pdf from the embeddings folder

ankitku92 commented 1 year ago

So, when you would have inserted the file to the vector db, you would have used some logic and passed the file path. This file path gets stored in the metadata of the vector embeddings in the dictionary using the 'source' key.

You need to use that exact file path. Otherwise the delete method will not work.

You can check the exact metadata of the files you have uploaded by the logic i showed above, below the comment - # To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'

When you get this, figure out what you are using and then pass the same path in your function. I'm sure it will work.

deepak-habilelabs commented 1 year ago

So, when you would have inserted the file to the vector db, you would have used some logic and passed the file path. This file path gets stored in the metadata of the vector embeddings in the dictionary using the 'source' key.

You need to use that exact file path. Otherwise the delete method will not work.

You can check the exact metadata of the files you have uploaded by the logic i showed above, below the comment - # To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'

When you get this, figure out what you are using and then pass the same path in your function. I'm sure it will work.

def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> None:
    """Delete by vector IDs.

    Args:
        ids: List of ids to delete.
    """
    self._collection.delete(ids=ids)
     the delete method in the collection object of ChromaDB in the LangChain framework is designed to delete vectors by their IDs, not by the source file path.
dosubot[bot] commented 9 months ago

Hi, @prateekaffine,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue you raised was regarding the inability to delete a specific document using its name or ID in Chroma Vectorstore, along with a question about why embeddings are not visible when using chroma_db.get(). The issue has been resolved with the help of code snippets and explanations provided by users @9akashnp8, @ankitku92, and @deepak-habilelabs. The discussion also included insights on how to delete embeddings from a document and the importance of using the exact file path stored in the metadata for successful deletion.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contributions!