Chroma returns the same document more than once when use as a retriver

amirhagai commented 4 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
blog_docs = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, 
    chunk_overlap=50)

splits = text_splitter.split_documents(blog_docs)

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
docs = retriever.get_relevant_documents("What is Task Decomposition?")

print(f"number of documents - {len(docs)}")
for doc in docs:
  print(f"document content - `{doc.__dict__}")

the printed values are

document content - {'page_content': 'Fig. 1. Overview of a LLM-powered autonomous agent system.\nComponent One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.\nTask decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', 'metadata': {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, 'type': 'Document'}

document content - {'page_content': 'Resources:\n1. Internet access for searches and information gathering.\n2. Long Term memory management.\n3. GPT-3.5 powered Agents for delegation of simple tasks.\n4. File output.\n\nPerformance Evaluation:\n1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.\n2. Constructively self-criticize your big-picture behavior constantly.\n3. Reflect on past decisions and strategies to refine your approach.\n4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.', 'metadata': {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, 'type': 'Document'}

as you can see 3 documents are the same.

I checked and splits contain 52 documents, but the value of

res = vectorstore.get()
res.keys()

len(res['documents'])

is 156, so I think each document is stored 3 times instead of 1.

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use chroma as retriever in a toy example and except to get different documents when 'get_relevant_documents' is applied. instead, I'm getting the same document 3 times

System Info

langchain==0.2.1 langchain-community==0.2.1 langchain-core==0.2.3 langchain-openai==0.1.8 langchain-text-splitters==0.2.0 langchainhub==0.1.17

Linux Python 3.10.12 I'm running on Colab

klaudialemiec commented 4 months ago

Hello @amirhagai, In the code above you run the following fragment 3 times: vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings()) That means you add documents to vectorestore 3 times.

I don't know if there is any mechanism which prevent saving of duplicates, but if you want to reset state of your vectorstore please run: vectorstore.delete_collection()

amirhagai commented 4 months ago

Thanks for the clarification! It actually appears twice as I mistakenly copy paste the same cell twice :)

But I do see that when I run the cell more than once the DB add items to itself, instead of creating new instance. does it the excepted behavior? Just to make things clear, each time I run this cell -

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

Chroma add items. and changes the retrieval even if it's not redefined.

The docs say that the function "Create a Chroma vectorstore from a list of documents." https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html Thanks again :)

mariano22 commented 3 months ago

I had the same weird behavior while learning about Langchain.

I don't know if it's expected behavior, but I think it's weird and should not be like that. It's confusing because, as @amirhagai mentioned, they are two different instances of "Chroma" wrapper. But internally, they refer to the same chroma collection, which is the default collection: _LANGCHAIN_DEFAULT_COLLECTION_NAME (defined as "lanchain"). Also (and also mentioned by @amirhagai) the docstring and documentation says "Create a Chroma vectorstore" which reinforces the idea of a "new and fresh collection"

I think that at least for classmethod from_text and from_documents, which expect to be intance factories, it should be called delete_collections automatically.

spike-spiegel-21 commented 3 months ago

@amirhagai @mariano22

Use from langchain_chroma.vectorstores import Chroma instead of from langchain_community.vectorstores import Chroma. because langchain_chroma.vectorstores is maintained regularly.
The behavior of vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings()) is correct as expected. It will be creating a collection of name _LANGCHAIN_DEFAULT_COLLECTION_NAME.

If you want to create a separate collection, please specify name of the collection as it is a necessary parameter while defining the chroma client. Example:

vectorstore_1 = Chroma.from_documents(documents=splits, embedding = embedding, collection_name="col")
vectorstore_2 = Chroma.from_documents(documents=splits, embedding = embedding, collection_name="sol")

You can also store them in your local directory since these are currently in your RAM and will be lost as soon as you stop the kernel. Example:

vectorstore = Chroma.from_documents(documents=splits, embedding = embeddings, collection_name="sol", persist_directory="my_dir")

However, I was not able to find a classmethod that returns vectorstore from persisted collection without insering new document. @eyurtsev Please quote me if I am wrong on this.

mariano22 commented 3 months ago

Thanks, @spike-spiegel-21, for the clarifications. The confusion came from the fact that the static method specifies a way of creating a vector store, but it's unclear if the collection already exists if it will load it and not overwrite it.

I think it would be more intuitive if .from_collection performs a delete_collection in case the collection name already exists.

But if you don't think the same, at least I would suggest clarifying in the tutorial and, especially, in the documentation: https://api.python.langchain.com/en/latest/vectorstores/langchain_chroma.vectorstores.Chroma.html#langchain_chroma.vectorstores.Chroma.from_documents (I had to read the code to understand the behaviour or supposing by what I was observing)

langchain-ai / langchain