When using Chroma, vector stores newly created with `from_texts()` do not delete previous documents

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

When using FAISS, each vector store newly created with from_texts() is initialized with zero documents:

import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores
def make_vs(texts):
    return langchain_community.vectorstores.FAISS.from_texts(
        texts=texts,
        embedding=embeddings)

vs = make_vs(['a', 'b', 'c'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['a', 'b', 'c']

vs = make_vs(['d', 'e', 'f'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['d', 'e', 'f']

But when using Chroma, subsequent vector stores newly created with from_texts() still have documents from previous vector stores:

import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores
def make_vs(texts):
    return langchain_community.vectorstores.Chroma.from_texts(
        texts=texts,
        embedding=embeddings)

vs = make_vs(['a', 'b', 'c'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['a', 'b', 'c']

vs = make_vs(['d', 'e', 'f'])
print([d.page_content for d in vs.similarity_search('z', k=100)])
# >>> returns ['a', 'b', 'c', 'd', 'e', 'f'] - INCORRECT

Error Message and Stack Trace (if applicable)

No response

Description

When I create a new Chroma vector store object with from_texts(), documents from previous vector stores are not deleted. The code above shows an example.

This occurs regardless of whether I assign to the same variable:

vs = make_vs(['a', 'b', 'c'])
vs = make_vs(['d', 'e', 'f'])

or to different variables:

vs1 = make_vs(['a', 'b', 'c'])
vs2 = make_vs(['d', 'e', 'f'])

or if I try to manually force object destruction:

vs = make_vs(['a', 'b', 'c'])
del vs
vs = make_vs(['d', 'e', 'f'])

Only an explicit delete_collection() will delete the documents:

vs = make_vs(['a', 'b', 'c'])
vs.delete_collection()
vs = make_vs(['d', 'e', 'f'])

but this is a workaround - the vector store is still incorrectly "sticky"; we're just deleting the documents. In addition, this not an easy workaround complex real-world cases where vector store operations are decentralized and called in different scopes.

If I want to add content to a vector store, I would use add_texts(). If I want to create a new vector store, then I would use from-texts() and any previous vector store content should be disregarded by construction.

System Info

System Information

OS: Linux OS Version: #1 SMP Wed Aug 10 16:21:17 UTC 2022 Python Version: 3.11.0 (main, Nov 10 2022, 08:24:18) [GCC 8.2.0]

Package Information

langchain_core: 0.1.45 langchain: 0.1.16 langchain_community: 0.0.34 langsmith: 0.1.50 langchain_openai: 0.0.6 langchain_text_splitters: 0.0.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph langserve

I believe the issue is that in Chroma the collections have the same default name and therefore you "get" the existing collection: https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/chroma.py#L126

whereas in FAISS, you always start from scratch (as far as I understand, no expert here): https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/faiss.py#L889

A potential fix would be to always remove the collection with a similar name first?

chroma_collection.delete_collection()

If this is desired behaviour and an appropriate solution, I will try to implement it :)

@AndresAlgaba - I think you are right.

In the __init__() method of Langchain's class Chroma(VectorStore), the Chroma _client value is always set to the value returned by chromadb.Client(_client_settings). And this value does seem to be unique across calls:

>>> vs1 = make_vs(['a', 'b', 'c'])
>>> print([d.page_content for d in vs1.similarity_search('z', k=100)])
['a', 'b', 'c']

>>> vs2 = make_vs(['d', 'e', 'f'])
>>> print([d.page_content for d in vs2.similarity_search('z', k=100)])
['a', 'b', 'c', 'd', 'e', 'f']

>>> print(vs1._client is vs2._client)
False
>>> print(vs1._client == vs2._client)
False

so I guess this stickiness is inside the chromadb package itself somewhere?

Langchain's purpose is to provide a common API with consistent behavior across the underlying components. We have a vector store manipulation utility that can process multiple stores in a run. When we switched from FAISS to Chroma, this utility stopped working because of this "stickiness" problem.

So, I think it would be good to hear from the Langchain devs on whether they define the reference behavior of the following methods to return a vector store with only the provided documents:

<any VectorStore class>.from_text(texts=...)
<any VectorStore class>.from_documents(documents=...)

I tried simulating your delete_collection() fix:

import langchain_community.vectorstores
def make_vs(texts):
    vs = langchain_community.vectorstores.Chroma(embedding_function=embeddings)
    vs.delete_collection()  # <-----
    vs.from_texts(texts, embedding=embeddings)
    return vs

but that throws an exception:

InvalidCollectionException: Collection 3b310aa8-9b60-4222-8639-7dac1d3a29e8 does not exist.

So again, I think we're at the point where the Langchain devs need to answer whether about whether the from_text() and from_documents() methods should create a fresh vector store with only the provided documents.

I am very new to both chromadb and langchain, but it seems to me that the problem is that the UUID of the collections are kept:

import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores
def make_vs(texts):
    return langchain_community.vectorstores.Chroma.from_texts(
        texts=texts,
        embedding=embeddings,
    )

vs1 = make_vs(['a', 'b', 'c'])
print(vs1._collection)

vs2 = make_vs(['d', 'e', 'f'])
print(vs2._collection)

results in

name='langchain' id=UUID('ef558cb7-d310-4807-8e53-c8c5076c8cb0') metadata=None tenant='default_tenant' database='default_database'
name='langchain' id=UUID('ef558cb7-d310-4807-8e53-c8c5076c8cb0') metadata=None tenant='default_tenant' database='default_database'

This seems to solve the issue:


import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)

import langchain_community.vectorstores def make_vs(texts): return langchain_community.vectorstores.Chroma.from_texts( texts=texts, embedding=embeddings, )

vs = make_vs(['a', 'b', 'c']) print(vs._collection) print([d.page_content for d in vs.similarity_search('z', k=100)]) vs.delete_collection()

vs = make_vs(['d', 'e', 'f']) print(vs._collection) print([d.page_content for d in vs.similarity_search('z', k=100)]) vs.delete_collection()

results in
```python
name='langchain' id=UUID('c5fe2af1-aa5b-4f13-b59a-c1ffc9e0bd9d') metadata=None tenant='default_tenant' database='default_database'
Number of requested results 100 is greater than number of elements in index 3, updating n_results = 3
['a', 'b', 'c']
name='langchain' id=UUID('0402a4f6-b90d-4f87-850c-36e500bb7aa1') metadata=None tenant='default_tenant' database='default_database'
Number of requested results 100 is greater than number of elements in index 3, updating n_results = 3
['f', 'e', 'd']

where we get a new collection UUID.

@AndresAlgaba - I think you are right.

In the __init__() method of Langchain's class Chroma(VectorStore), the Chroma _client value is always set to the value returned by chromadb.Client(_client_settings). And this value does seem to be unique across calls:
>>> vs1 = make_vs(['a', 'b', 'c'])
>>> print([d.page_content for d in vs1.similarity_search('z', k=100)])
['a', 'b', 'c']

>>> vs2 = make_vs(['d', 'e', 'f'])
>>> print([d.page_content for d in vs2.similarity_search('z', k=100)])
['a', 'b', 'c', 'd', 'e', 'f']

>>> print(vs1._client is vs2._client)
False
>>> print(vs1._client == vs2._client)
False
so I guess this stickiness is inside the chromadb package itself somewhere?

Langchain's purpose is to provide a common API with consistent behavior across the underlying components. We have a vector store manipulation utility that can process multiple stores in a run. When we switched from FAISS to Chroma, this utility stopped working because of this "stickiness" problem.

So, I think it would be good to hear from the Langchain devs on whether they define the reference behavior of the following methods to return a vector store with only the provided documents:
<any VectorStore class>.from_text(texts=...)
<any VectorStore class>.from_documents(documents=...)

according to the base comments,i think,should recreate document. from_texts is also the same.

`Return VectorStore initialized from documents and embeddings.

https://github.com/langchain-ai/langchain/blob/4c437ebb9c2fb532ce655ac1e0c354c82a715df7/libs/core/langchain_core/vectorstores.py#L541

but chroma from_documents or from_text comments is texts (List[str]): List of texts to add to the collection https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/chroma.py#L682

i looked at the implementation codes of other classes in the current package. the behaviors of implementing base class methods are inconsistent. some create a new one while others add texts. confused. if confirmed by langchain dev.reimplement may be break change.

langchain-ai / langchain