Open chrispy-snps opened 2 months ago
I believe the issue is that in Chroma the collections have the same default name and therefore you "get" the existing collection: https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/chroma.py#L126
whereas in FAISS, you always start from scratch (as far as I understand, no expert here): https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/faiss.py#L889
A potential fix would be to always remove the collection with a similar name first?
chroma_collection.delete_collection()
If this is desired behaviour and an appropriate solution, I will try to implement it :)
@AndresAlgaba - I think you are right.
In the __init__()
method of Langchain's class Chroma(VectorStore)
, the Chroma _client
value is always set to the value returned by chromadb.Client(_client_settings)
. And this value does seem to be unique across calls:
>>> vs1 = make_vs(['a', 'b', 'c'])
>>> print([d.page_content for d in vs1.similarity_search('z', k=100)])
['a', 'b', 'c']
>>> vs2 = make_vs(['d', 'e', 'f'])
>>> print([d.page_content for d in vs2.similarity_search('z', k=100)])
['a', 'b', 'c', 'd', 'e', 'f']
>>> print(vs1._client is vs2._client)
False
>>> print(vs1._client == vs2._client)
False
so I guess this stickiness is inside the chromadb
package itself somewhere?
Langchain's purpose is to provide a common API with consistent behavior across the underlying components. We have a vector store manipulation utility that can process multiple stores in a run. When we switched from FAISS to Chroma, this utility stopped working because of this "stickiness" problem.
So, I think it would be good to hear from the Langchain devs on whether they define the reference behavior of the following methods to return a vector store with only the provided documents:
<any VectorStore class>.from_text(texts=...)
<any VectorStore class>.from_documents(documents=...)
I tried simulating your delete_collection()
fix:
import langchain_community.vectorstores
def make_vs(texts):
vs = langchain_community.vectorstores.Chroma(embedding_function=embeddings)
vs.delete_collection() # <-----
vs.from_texts(texts, embedding=embeddings)
return vs
but that throws an exception:
InvalidCollectionException: Collection 3b310aa8-9b60-4222-8639-7dac1d3a29e8 does not exist.
So again, I think we're at the point where the Langchain devs need to answer whether about whether the from_text()
and from_documents()
methods should create a fresh vector store with only the provided documents.
I am very new to both chromadb and langchain, but it seems to me that the problem is that the UUID of the collections are kept:
import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)
import langchain_community.vectorstores
def make_vs(texts):
return langchain_community.vectorstores.Chroma.from_texts(
texts=texts,
embedding=embeddings,
)
vs1 = make_vs(['a', 'b', 'c'])
print(vs1._collection)
vs2 = make_vs(['d', 'e', 'f'])
print(vs2._collection)
results in
name='langchain' id=UUID('ef558cb7-d310-4807-8e53-c8c5076c8cb0') metadata=None tenant='default_tenant' database='default_database'
name='langchain' id=UUID('ef558cb7-d310-4807-8e53-c8c5076c8cb0') metadata=None tenant='default_tenant' database='default_database'
import langchain_community.embeddings
embeddings = langchain_community.embeddings.FakeEmbeddings(size=4)
import langchain_community.vectorstores def make_vs(texts): return langchain_community.vectorstores.Chroma.from_texts( texts=texts, embedding=embeddings, )
vs = make_vs(['a', 'b', 'c']) print(vs._collection) print([d.page_content for d in vs.similarity_search('z', k=100)]) vs.delete_collection()
vs = make_vs(['d', 'e', 'f']) print(vs._collection) print([d.page_content for d in vs.similarity_search('z', k=100)]) vs.delete_collection()
results in
```python
name='langchain' id=UUID('c5fe2af1-aa5b-4f13-b59a-c1ffc9e0bd9d') metadata=None tenant='default_tenant' database='default_database'
Number of requested results 100 is greater than number of elements in index 3, updating n_results = 3
['a', 'b', 'c']
name='langchain' id=UUID('0402a4f6-b90d-4f87-850c-36e500bb7aa1') metadata=None tenant='default_tenant' database='default_database'
Number of requested results 100 is greater than number of elements in index 3, updating n_results = 3
['f', 'e', 'd']
where we get a new collection UUID.
@AndresAlgaba - I think you are right.
In the
__init__()
method of Langchain'sclass Chroma(VectorStore)
, the Chroma_client
value is always set to the value returned bychromadb.Client(_client_settings)
. And this value does seem to be unique across calls:>>> vs1 = make_vs(['a', 'b', 'c']) >>> print([d.page_content for d in vs1.similarity_search('z', k=100)]) ['a', 'b', 'c'] >>> vs2 = make_vs(['d', 'e', 'f']) >>> print([d.page_content for d in vs2.similarity_search('z', k=100)]) ['a', 'b', 'c', 'd', 'e', 'f'] >>> print(vs1._client is vs2._client) False >>> print(vs1._client == vs2._client) False
so I guess this stickiness is inside the
chromadb
package itself somewhere?Langchain's purpose is to provide a common API with consistent behavior across the underlying components. We have a vector store manipulation utility that can process multiple stores in a run. When we switched from FAISS to Chroma, this utility stopped working because of this "stickiness" problem.
So, I think it would be good to hear from the Langchain devs on whether they define the reference behavior of the following methods to return a vector store with only the provided documents:
<any VectorStore class>.from_text(texts=...) <any VectorStore class>.from_documents(documents=...)
according to the base comments,i think,should recreate document. from_texts is also the same.
`Return VectorStore initialized from documents and embeddings.
but chroma from_documents or from_text comments is
texts (List[str]): List of texts to add to the collection
https://github.com/langchain-ai/langchain/blob/6ccecf23639ef5cbebcbc4eaeda99eb1f7b84deb/libs/community/langchain_community/vectorstores/chroma.py#L682
i looked at the implementation codes of other classes in the current package. the behaviors of implementing base class methods are inconsistent. some create a new one while others add texts. confused. if confirmed by langchain dev.reimplement may be break change.
Checked other resources
Example Code
When using FAISS, each vector store newly created with
from_texts()
is initialized with zero documents:But when using Chroma, subsequent vector stores newly created with
from_texts()
still have documents from previous vector stores:Error Message and Stack Trace (if applicable)
No response
Description
When I create a new Chroma vector store object with
from_texts()
, documents from previous vector stores are not deleted. The code above shows an example.This occurs regardless of whether I assign to the same variable:
or to different variables:
or if I try to manually force object destruction:
Only an explicit
delete_collection()
will delete the documents:but this is a workaround - the vector store is still incorrectly "sticky"; we're just deleting the documents. In addition, this not an easy workaround complex real-world cases where vector store operations are decentralized and called in different scopes.
If I want to add content to a vector store, I would use
add_texts()
. If I want to create a new vector store, then I would usefrom-texts()
and any previous vector store content should be disregarded by construction.System Info
System Information
Package Information
Packages not installed (Not Necessarily a Problem)
The following packages were not found: