langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.29k stars 14.07k forks source link

When using Chroma, vector stores newly created with `from_texts()` do not delete previous documents #20866

Open chrispy-snps opened 2 months ago

chrispy-snps commented 2 months ago

Checked other resources

Example Code

When using FAISS, each vector store newly created with from_texts() is initialized with zero documents:

import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores
def make_vs(texts):
    return langchain_community.vectorstores.FAISS.from_texts(
        texts=texts,
        embedding=embeddings)

vs = make_vs(['a', 'b', 'c'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['a', 'b', 'c']

vs = make_vs(['d', 'e', 'f'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['d', 'e', 'f']

But when using Chroma, subsequent vector stores newly created with from_texts() still have documents from previous vector stores:

import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores
def make_vs(texts):
    return langchain_community.vectorstores.Chroma.from_texts(
        texts=texts,
        embedding=embeddings)

vs = make_vs(['a', 'b', 'c'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['a', 'b', 'c']

vs = make_vs(['d', 'e', 'f'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['a', 'b', 'c', 'd', 'e', 'f'] - INCORRECT

Error Message and Stack Trace (if applicable)

No response

Description

When I create a new Chroma vector store object with from_texts(), documents from previous vector stores are not deleted. The code above shows an example.

This occurs regardless of whether I assign to the same variable:

vs = make_vs(['a', 'b', 'c'])
vs = make_vs(['d', 'e', 'f'])

or to different variables:

vs1 = make_vs(['a', 'b', 'c'])
vs2 = make_vs(['d', 'e', 'f'])

or if I try to manually force object destruction:

vs = make_vs(['a', 'b', 'c'])
del vs
vs = make_vs(['d', 'e', 'f'])

Only an explicit delete_collection() will delete the documents:

vs = make_vs(['a', 'b', 'c'])
vs.delete_collection()
vs = make_vs(['d', 'e', 'f'])

but this is a workaround - the vector store is still incorrectly "sticky"; we're just deleting the documents. In addition, this not an easy workaround complex real-world cases where vector store operations are decentralized and called in different scopes.

If I want to add content to a vector store, I would use add_texts(). If I want to create a new vector store, then I would use from-texts() and any previous vector store content should be disregarded by construction.

System Info

System Information

OS: Linux OS Version: #1 SMP Wed Aug 10 16:21:17 UTC 2022 Python Version: 3.11.0 (main, Nov 10 2022, 08:24:18) [GCC 8.2.0]

Package Information

langchain_core: 0.1.45 langchain: 0.1.16 langchain_community: 0.0.34 langsmith: 0.1.50 langchain_openai: 0.0.6 langchain_text_splitters: 0.0.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph langserve

AndresAlgaba commented 2 months ago

I believe the issue is that in Chroma the collections have the same default name and therefore you "get" the existing collection: https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/chroma.py#L126

whereas in FAISS, you always start from scratch (as far as I understand, no expert here): https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/faiss.py#L889

A potential fix would be to always remove the collection with a similar name first?

chroma_collection.delete_collection()

If this is desired behaviour and an appropriate solution, I will try to implement it :)

chrispy-snps commented 2 months ago

@AndresAlgaba - I think you are right.

In the __init__() method of Langchain's class Chroma(VectorStore), the Chroma _client value is always set to the value returned by chromadb.Client(_client_settings). And this value does seem to be unique across calls:

>>> vs1 = make_vs(['a', 'b', 'c'])
>>> print([d.page_content for d in vs1.similarity_search('z', k=100)])
['a', 'b', 'c']

>>> vs2 = make_vs(['d', 'e', 'f'])
>>> print([d.page_content for d in vs2.similarity_search('z', k=100)])
['a', 'b', 'c', 'd', 'e', 'f']

>>> print(vs1._client is vs2._client)
False
>>> print(vs1._client == vs2._client)
False

so I guess this stickiness is inside the chromadb package itself somewhere?

Langchain's purpose is to provide a common API with consistent behavior across the underlying components. We have a vector store manipulation utility that can process multiple stores in a run. When we switched from FAISS to Chroma, this utility stopped working because of this "stickiness" problem.

So, I think it would be good to hear from the Langchain devs on whether they define the reference behavior of the following methods to return a vector store with only the provided documents:

<any VectorStore class>.from_text(texts=...)
<any VectorStore class>.from_documents(documents=...)
chrispy-snps commented 2 months ago

I tried simulating your delete_collection() fix:

import langchain_community.vectorstores
def make_vs(texts):
    vs = langchain_community.vectorstores.Chroma(embedding_function=embeddings)
    vs.delete_collection()  # <-----
    vs.from_texts(texts, embedding=embeddings)
    return vs

but that throws an exception:

InvalidCollectionException: Collection 3b310aa8-9b60-4222-8639-7dac1d3a29e8 does not exist.

So again, I think we're at the point where the Langchain devs need to answer whether about whether the from_text() and from_documents() methods should create a fresh vector store with only the provided documents.

AndresAlgaba commented 2 months ago

I am very new to both chromadb and langchain, but it seems to me that the problem is that the UUID of the collections are kept:

import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores
def make_vs(texts):
    return langchain_community.vectorstores.Chroma.from_texts(
        texts=texts,
        embedding=embeddings,
    )

vs1 = make_vs(['a', 'b', 'c'])
print(vs1._collection)

vs2 = make_vs(['d', 'e', 'f'])
print(vs2._collection)

results in

name='langchain' id=UUID('ef558cb7-d310-4807-8e53-c8c5076c8cb0') metadata=None tenant='default_tenant' database='default_database'
name='langchain' id=UUID('ef558cb7-d310-4807-8e53-c8c5076c8cb0') metadata=None tenant='default_tenant' database='default_database'
  1. This seems to solve the issue:
    
    import langchain_community.embeddings
    embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores def make_vs(texts): return langchain_community.vectorstores.Chroma.from_texts( texts=texts, embedding=embeddings, )

vs = make_vs(['a', 'b', 'c']) print(vs._collection) print([d.page_content for d in vs.similarity_search('z', k=100)]) vs.delete_collection()

vs = make_vs(['d', 'e', 'f']) print(vs._collection) print([d.page_content for d in vs.similarity_search('z', k=100)]) vs.delete_collection()

results in
```python
name='langchain' id=UUID('c5fe2af1-aa5b-4f13-b59a-c1ffc9e0bd9d') metadata=None tenant='default_tenant' database='default_database'
Number of requested results 100 is greater than number of elements in index 3, updating n_results = 3
['a', 'b', 'c']
name='langchain' id=UUID('0402a4f6-b90d-4f87-850c-36e500bb7aa1') metadata=None tenant='default_tenant' database='default_database'
Number of requested results 100 is greater than number of elements in index 3, updating n_results = 3
['f', 'e', 'd']

where we get a new collection UUID.

qingdengyue commented 2 months ago

@AndresAlgaba - I think you are right.

In the __init__() method of Langchain's class Chroma(VectorStore), the Chroma _client value is always set to the value returned by chromadb.Client(_client_settings). And this value does seem to be unique across calls:

>>> vs1 = make_vs(['a', 'b', 'c'])
>>> print([d.page_content for d in vs1.similarity_search('z', k=100)])
['a', 'b', 'c']

>>> vs2 = make_vs(['d', 'e', 'f'])
>>> print([d.page_content for d in vs2.similarity_search('z', k=100)])
['a', 'b', 'c', 'd', 'e', 'f']

>>> print(vs1._client is vs2._client)
False
>>> print(vs1._client == vs2._client)
False

so I guess this stickiness is inside the chromadb package itself somewhere?

Langchain's purpose is to provide a common API with consistent behavior across the underlying components. We have a vector store manipulation utility that can process multiple stores in a run. When we switched from FAISS to Chroma, this utility stopped working because of this "stickiness" problem.

So, I think it would be good to hear from the Langchain devs on whether they define the reference behavior of the following methods to return a vector store with only the provided documents:

<any VectorStore class>.from_text(texts=...)
<any VectorStore class>.from_documents(documents=...)

according to the base comments,i think,should recreate document. from_texts is also the same.

`Return VectorStore initialized from documents and embeddings.

https://github.com/langchain-ai/langchain/blob/4c437ebb9c2fb532ce655ac1e0c354c82a715df7/libs/core/langchain_core/vectorstores.py#L541

but chroma from_documents or from_text comments is texts (List[str]): List of texts to add to the collection https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/chroma.py#L682

i looked at the implementation codes of other classes in the current package. the behaviors of implementing base class methods are inconsistent. some create a new one while others add texts. confused. if confirmed by langchain dev.reimplement may be break change.