langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.47k stars 15.28k forks source link

Langchain Chroma doesn't handle duplicate entry properly #24005

Open mou23 opened 4 months ago

mou23 commented 4 months ago

Checked other resources

Example Code

import chromadb from langchain_chroma.vectorstores import Chroma from langchain_huggingface import HuggingFaceEmbeddings from langchain_core.documents import Document

client = chromadb.Client() collection = client.create_collection(name="my_collection", metadata={"hnsw:space": "cosine"}) embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cuda'})

vector_store = Chroma( client=client, collection_name="my_collection", embedding_function=embedding_function, )

documents = [ Document( id = '1', page_content = 'This is a document about fruit', metadata = {'title': 'First Doc'} ), Document( id = '2', page_content = 'This is a document about oranges', metadata = {'title': 'Second Doc'} ), Document( id = '3', page_content = 'I saw a lady wearing red dress', metadata = {'title': 'Third Doc'} ), Document( id = '4', page_content = 'Apples are red', metadata = {'title': 'Fourth Doc'} ), ]

vector_store.add_documents(documents)

print(vector_store._collection.get(include = ["documents"])) print("db size ", vector_store._collection.count())

duplicate_document = [Document( id = '1', page_content = 'This is a document about fruit', metadata = {'title': 'First Doc'} )] vector_store.add_documents(duplicate_document)

print(vector_store._collection.get(include = ["documents"])) print("db size ", vector_store._collection.count())

Error Message and Stack Trace (if applicable)

No response

Description

Langchain-chroma adds duplicate entry to the db, whereas Chromadb doesn't add duplicate entry. So, the behavior isn't same for Langchain-chroma and Chromadb.

import chromadb from chromadb.utils import embedding_functions

client = chromadb.Client() embedder = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2",device='cuda') collection = client.create_collection(name="my_collection", embedding_function = embedder, metadata={"hnsw:space": "cosine"})

collection.add( documents=[ "This is a document about fruit", "This is a document about oranges", "I saw a lady wearing red dress", "Apples are red", ], ids=["1", "2", "3", "4"], metadatas=[ {'title': 'First Doc'}, {'title': 'Second Doc'}, {'title': 'Third Doc'}, {'title': 'Fourth Doc'}, ] )

print(collection.get(include=['documents'])) print("db size ",collection.count())

collection.add( documents=[ "This is a document about fruit", ], ids=["1"], metadatas=[ {'title': 'First Doc'}] )

print(collection.get(include=['documents'])) print("db size ",collection.count())

System Info

Python version: 3.10.10

RuofanChen03 commented 1 month ago

Unable to replicate issue; seems fixed now. Please take a look if you still have this issue!