Closed prateekaffine closed 9 months ago
hey @prateekaffine , the Collection
object does support deletion by ids. You could implement a custom Chroma vectorstore that has this functionality.
Eg:
from langchain.vectorstores import Chroma
class CustomChroma(Chroma):
def delete(self, ids, where, where_document):
ids = validate_ids(maybe_cast_one_to_many(ids)) if ids else None
where = validate_where(where) if where else None
where_document = validate_where_document(where_document) if where_document else None
return self._client._delete(self.name, ids, where, where_document)
Reference: https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py#L306
@9akashnp8 Thanks for sharing this, I will try this.
@hwchase17 Also, I was checking the embeddings are None in the vectorstore using this operatioon any idea why? or some wrong is there the way I am doing it
vectorstore = Chroma.from_documents(doc_chunks, embeddings=OpenAIEmbeddings(model="text-embedding-ada-002", chunk_size=1), collection_name="my_collection", persist_directory="my_embedding_path")
If you have different collection for each of you users. Lets say you have collection-1 and collection-2:
At the end in each collection you are going to have different chunks with uniques ID.
Now user 2 wants to delete doc3.pdf and of course the embeddings from doc3.pdf must be deleted, chroma does not allow you to delete embeddings from a document ID, so to fix this you need to store in a database those IDs, then you are going to have in a database the ID for doc3.pdf and its chunks ids.
If the users delete doc3.pdf you would know its id, and with that id you can access to chunks id, and with chunks id you can delete those chunks from your collection.
Thats the way I solved that problem.
I used the below code to delete from chroma using the document names. Since document path was getting stored in metadata. i was testing different chroma functions, so you will find other things as well in the code. Look for yourself what you need. Note: I had used Chroma.from_documents to load the chunks to chroma This works after the update to chroma where now it uses SQLite instead of duckdb. have not tested in the old version. I tested if the document was deleted using the method to fetch all the filenames given below and it actually had removed the chunks of that file, so i know it works.
import chromadb
import os
from langchain.vectorstores import Chroma
persist_directory = "Database\\chroma_db\\"+"test3"
if not os.path.exists(persist_directory):
os.makedirs(persist_directory)
# Get the Chroma DB object
chroma_db = chromadb.PersistentClient(path=persist_directory)
collection = chroma_db.get_collection(name="langchain")
# Get the metadata list
metadata_list = collection.get()['metadatas']
print(metadata_list)
# To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'
file_names = []
for metadata in metadata_list:
filename = metadata['source'].split('\\')[-1]
if filename not in file_names:
file_names.append(filename)
print(file_names)
# the below will fetch the ids of the all the document chunks from this doc1 file name
print(collection.get(where={"source": "uploads\\temp_files\\doc1.pdf"})['ids'])
# the below deletes all the chunks from the doc1 file
collection.delete(
where={"source": "uploads\\temp_files\\doc1.pdf"}
)
@ankitku92 thank you. This worked for me too.
@ankitku92 thanks for sharing, it works
Thanks, @ankitku92 The code is working very well Let's keep in touch
I used the below code to delete from chroma using the document names. Since document path was getting stored in metadata. i was testing different chroma functions, so you will find other things as well in the code. Look for yourself what you need. Note: I had used Chroma.from_documents to load the chunks to chroma This works after the update to chroma where now it uses SQLite instead of duckdb. have not tested in the old version. I tested if the document was deleted using the method to fetch all the filenames given below and it actually had removed the chunks of that file, so i know it works.
import chromadb import os from langchain.vectorstores import Chroma persist_directory = "Database\\chroma_db\\"+"test3" if not os.path.exists(persist_directory): os.makedirs(persist_directory) # Get the Chroma DB object chroma_db = chromadb.PersistentClient(path=persist_directory) collection = chroma_db.get_collection(name="langchain") # Get the metadata list metadata_list = collection.get()['metadatas'] print(metadata_list) # To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source' file_names = [] for metadata in metadata_list: filename = metadata['source'].split('\\')[-1] if filename not in file_names: file_names.append(filename) print(file_names) # the below will fetch the ids of the all the document chunks from this doc1 file name print(collection.get(where={"source": "uploads\\temp_files\\doc1.pdf"})['ids']) # the below deletes all the chunks from the doc1 file collection.delete( where={"source": "uploads\\temp_files\\doc1.pdf"} )
def delete_document_embeddings_by_filename(file_path, persist_directory): chroma_db = chromadb.PersistentClient(path=persist_directory) print(chroma_db) collection = chroma_db.get_collection(name="langchain") print(collection) collection.delete(where={"source": file_path}) output of the above code is:- <chromadb.api.segment.SegmentAPI object at 0x7f4948165280> name='langchain' id=UUID('8a5e8fff-93a4-49f3-8be7-5aac47cb3902') metadata=None And I am calling like this persist_directory=f'/home/hs/CustomBot/chroma-databases/{formatted_project_name}' file=/home/hs/CustomBot/media/project/Code_of_Conduct_Policy.pdf delete_document_embeddings_by_filename(file, persist_directory) Still not able to delete embeddings of a pdf from the embeddings folder
So, when you would have inserted the file to the vector db, you would have used some logic and passed the file path. This file path gets stored in the metadata of the vector embeddings in the dictionary using the 'source' key.
You need to use that exact file path. Otherwise the delete method will not work.
You can check the exact metadata of the files you have uploaded by the logic i showed above, below the comment - # To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'
When you get this, figure out what you are using and then pass the same path in your function. I'm sure it will work.
So, when you would have inserted the file to the vector db, you would have used some logic and passed the file path. This file path gets stored in the metadata of the vector embeddings in the dictionary using the 'source' key.
You need to use that exact file path. Otherwise the delete method will not work.
You can check the exact metadata of the files you have uploaded by the logic i showed above, below the comment - # To get the filenames from the metadata, where the filename is stored inside the dictionary as the index - 'source'
When you get this, figure out what you are using and then pass the same path in your function. I'm sure it will work.
def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> None:
"""Delete by vector IDs.
Args:
ids: List of ids to delete.
"""
self._collection.delete(ids=ids)
the delete method in the collection object of ChromaDB in the LangChain framework is designed to delete vectors by their IDs, not by the source file path.
Hi, @prateekaffine,
I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue you raised was regarding the inability to delete a specific document using its name or ID in Chroma Vectorstore, along with a question about why embeddings are not visible when using chroma_db.get()
. The issue has been resolved with the help of code snippets and explanations provided by users @9akashnp8, @ankitku92, and @deepak-habilelabs. The discussion also included insights on how to delete embeddings from a document and the importance of using the exact file path stored in the metadata for successful deletion.
Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you for your understanding and contributions!
System Info
I was able to somehow fetch the document chunk id's from chroma db, but I did not get how can I delete a specific document using its document name or document id. I have gone thru all references did not find a solution for it. Insertion, Updation is there but not deletion. Can you please help on this?
Extract id of document chunks in vector database
chroma_db.get()['ids']
Also, I had a doubt when we pass embeddings in Chroma, why we see embeddings = None when I run chroma_db.get() I could only see document chunks and their id when I tried displaying the info inside, so why are we not seeing embedding for each document chunk here?
@hwchase17, it would be great if you can help on this, Thanks in advance.
Who can help?
@hwchase17
Information
Related Components
Reproduction
from langchain.document_loaders import DirectoryLoader from langchain.embeddings import OpenAIEmbeddings
documents_pdf = DirectoryLoader(directory_path, glob="*/.txt").load()
openai_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002") chroma_db = Chroma.from_documents(doc_chunks, embedding = openai_embeddings ) chroma_db.get()
all id's in chroma db:
list_of_chunk_ids = chroma_db.get()['ids'] print(len(list_of_chunk_ids))
Expected behavior