[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
import chromadb
from langchain_chroma.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
Langchain-chroma adds duplicate entry to the db, whereas Chromadb doesn't add duplicate entry. So, the behavior isn't same for Langchain-chroma and Chromadb.
import chromadb
from chromadb.utils import embedding_functions
collection.add(
documents=[
"This is a document about fruit",
"This is a document about oranges",
"I saw a lady wearing red dress",
"Apples are red",
],
ids=["1", "2", "3", "4"],
metadatas=[
{'title': 'First Doc'},
{'title': 'Second Doc'},
{'title': 'Third Doc'},
{'title': 'Fourth Doc'},
]
)
Checked other resources
Example Code
import chromadb from langchain_chroma.vectorstores import Chroma from langchain_huggingface import HuggingFaceEmbeddings from langchain_core.documents import Document
client = chromadb.Client() collection = client.create_collection(name="my_collection", metadata={"hnsw:space": "cosine"}) embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cuda'})
vector_store = Chroma( client=client, collection_name="my_collection", embedding_function=embedding_function, )
documents = [ Document( id = '1', page_content = 'This is a document about fruit', metadata = {'title': 'First Doc'} ), Document( id = '2', page_content = 'This is a document about oranges', metadata = {'title': 'Second Doc'} ), Document( id = '3', page_content = 'I saw a lady wearing red dress', metadata = {'title': 'Third Doc'} ), Document( id = '4', page_content = 'Apples are red', metadata = {'title': 'Fourth Doc'} ), ]
vector_store.add_documents(documents)
print(vector_store._collection.get(include = ["documents"])) print("db size ", vector_store._collection.count())
duplicate_document = [Document( id = '1', page_content = 'This is a document about fruit', metadata = {'title': 'First Doc'} )] vector_store.add_documents(duplicate_document)
print(vector_store._collection.get(include = ["documents"])) print("db size ", vector_store._collection.count())
Error Message and Stack Trace (if applicable)
No response
Description
Langchain-chroma adds duplicate entry to the db, whereas Chromadb doesn't add duplicate entry. So, the behavior isn't same for Langchain-chroma and Chromadb.
import chromadb from chromadb.utils import embedding_functions
client = chromadb.Client() embedder = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2",device='cuda') collection = client.create_collection(name="my_collection", embedding_function = embedder, metadata={"hnsw:space": "cosine"})
collection.add( documents=[ "This is a document about fruit", "This is a document about oranges", "I saw a lady wearing red dress", "Apples are red", ], ids=["1", "2", "3", "4"], metadatas=[ {'title': 'First Doc'}, {'title': 'Second Doc'}, {'title': 'Third Doc'}, {'title': 'Fourth Doc'}, ] )
print(collection.get(include=['documents'])) print("db size ",collection.count())
collection.add( documents=[ "This is a document about fruit", ], ids=["1"], metadatas=[ {'title': 'First Doc'}] )
print(collection.get(include=['documents'])) print("db size ",collection.count())
System Info
Python version: 3.10.10