langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.47k stars 15.28k forks source link

Chroma Update document in Langchain forces unwanted behaviour #25392

Open ichigo97 opened 2 months ago

ichigo97 commented 2 months ago

Checked other resources

Example Code

from langchain.schema.documents import Document
from langchain_community.vectostores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.")

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees." )

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!" )

documents = [
    document_1,
    document_2,
    document_3
    ]

ids = [str(i) for i in range(len(documents))]

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vectorstore = Chroma(collection_name='test', embedding_function=embeddings, persist_directory = 'testdb')   
vectorstore.add_documents(documents=documents, ids=ids)
id_to_be_updated = 2
updated_doc = Document(page_content = "This is a test document.")
vectorstore.update_documents(ids=[id_to_be_updated], documents=[updated_doc])

Error Message and Stack Trace (if applicable)

Traceback (most recent call last): Explain with Al File "C:\Users\1956750\PycharmProjects\vectordb_crud\adhoc.py", line 62, in vectorstore.update_documents (ids=[replace_id], documents=[Document (page_content="Raspberry pi is a microprocessor",) ]) File "C:\Users\1956750\Pycharm Projects\vectordb_crud.venv\lib\site-packages\langchain_community\vectorstores\chroma.py", line 774, in update_documents self._collection.update( File "C:\Users\1956750\PycharmProjects\vectordb_crud.venv\lib\site-packages\chromadb\api\models\Collection.py", line 259, in update ) = self._validate_and_prepare_update_request( File "C:\Users\1956750\PycharmProjects\vectordb_crud.venv\lib\site-packages\chromadb\api\models\CollectionCommon.py", line 480, in _validate_and_prepare_update_request ) = self._validate_embedding_set( File "C:\Users\1956750\Pycharm Projects\vectordb_crud.venv\lib\site-packages\chromadb\api\models\Collection Common.py", line 182, in _validate_embedding_set validate_metadatas (maybe_cast_one_to_many_metadata (metadatas)) File "C:\Users\1956750\Pycharm Projects\vectordb_crud.venv\lib\site-packages\chromadb\api\types.py", line 336, in validate_metadatas validate metadata (metadata) File "C:\Users\1956750\Pycharm Projects\vectordb_crud.venv\lib\site-packages\chromadb\api\types.py", line 288, in validate_metadata raise ValueError( ValueError: Expected metadata to be a non-empty dict, got 0 metadata attributes

Description

While trying to update the documents using update_documents method to an already existing chroma collection I'm facing ValueError due to the absence of metadata parameter(which is optional) in the Document object used. When I followed the error stack trace I was able to identify that the issue was occured due to the creation of empty metadata list even when metadata argument was not supplied. This empty list object causes failure of validate_metadata check in the chromadb library. The check expects when the metadata is not passed the argument to be None or a null object. In this case, an empty list is being created and this breaks the check. A workaround I followed was to provide some random metadata and it updated the documents as expected. But I believe that this behavior was not intended by Chroma developers as they gave the flexibility of not providing the metadata in their checks.

System Info

System Information

OS: Windows OS Version: 10.0.22631 Python Version: 3.10.10 (tags/v3.10.10:aad5f6a, Feb 7 2023, 17:20:36) [MSC v.1929 64 bit (AMD64)] Package Information langchain_core: 0.2.30 langchain: 0.2.13 langchain_community: 0.2.12 langsmith: 0.1.99 langchain_chroma: 0.1.2 langchain_text_splitters: 0.2.2 Optional packages not installed langgraph langserve Other Dependencies aiohttp: 3.10.3 async-timeout: 4.0.3 chromadb: 0.5.5 dataclasses-json: 0.6.7 fastapi: 0.112.0 jsonpatch: 1.33 numpy: 1.26.3 orjson: 3.10.7 packaging: 24.1 pydantic: 2.8.2 PYYAML: 6.0.2 › requests: 2.32.3 SQLAlchemy: 2.0.32 tenacity: 8.5.0 typing-extensions: 4.9.0

ichigo97 commented 2 months ago

Reference link to the line causing the error: https://github.com/langchain-ai/langchain/blob/f4196f1fb8d31950ca42fb068a21519d1aee1970/libs/community/langchain_community/vectorstores/chroma.py#L748