SemanticSimilarityExampleSelector with Chroma return duplicated examples

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.

Example Code

The following code produces the duplicated examples of sunny. It is output correctly using FAISS.

from langchain_community.vectorstores import Chroma

examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "slow", "output": "fast"},    
    {"input": "windy", "output": "calm"},
]

example_selector3 = SemanticSimilarityExampleSelector.from_examples(
    examples,
    VertexAIEmbeddings("textembedding-gecko@001"),
    Chroma,
    k=2,
)

similar_prompt = FewShotPromptTemplate(
    example_selector=example_selector3,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)

print(similar_prompt.format(adjective="rainny"))

Output looks like the below

Give the antonym of every input

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: rainny
Output:

Error Message and Stack Trace (if applicable)

No response

Description

Below is the output using FAISS. I do not expect the output to be the same but at least it should not contain duplicates. If we put k=10, it duplicates the example and give more examples than the original list.

Give the antonym of every input

Input: sunny
Output: gloomy

Input: windy
Output: calm

Input: rainny
Output:

Below is the output when k=10

Give the antonym of every input

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: sunny
Output: gloomy

Input: windy
Output: calm

Input: rainny
Output:

System Info

System Information

OS: Linux OS Version: #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023 Python Version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

Package Information

langchain_core: 0.1.18 langchain: 0.1.5 langchain_community: 0.0.17 langchain_google_vertexai: 0.0.3

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph langserve

🤖

Hey @zlnguyen, great to see you diving into some more LangChain intricacies! How's everything going on your end?

Thank you for providing such a detailed report. It seems like the issue you're experiencing is related to the SemanticSimilarityExampleSelector class when used with the Chroma vector store. Specifically, it's returning duplicate examples when the number of examples (k) is set to a value greater than the number of unique examples in the vector store.

This behavior is likely due to how the Chroma vector store handles similarity searches. When the similarity_search method is called, it retrieves the k most similar examples from the vector store. If there are fewer unique examples than k, it's possible that the same example could be returned multiple times.

One way to confirm this would be to check the behavior of the Chroma vector store's similarity_search method directly. You could do this by creating a Chroma vector store with your examples, adding them to the store, and then performing a similarity search with k set to a value greater than the number of unique examples. If the same example is returned multiple times, then this would confirm that the issue is with the Chroma vector store's handling of similarity searches.

Here's an example of how you could do this:

from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import Chroma

examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "slow", "output": "fast"},    
    {"input": "windy", "output": "calm"},
]

# Convert examples to strings
string_examples = [" ".join(sorted_values(eg)) for eg in examples]

# Create a Chroma vector store with the examples
vectorstore = Chroma.from_texts(
    string_examples, 
    VertexAIEmbeddings("textembedding-gecko@001"), 
    metadatas=examples
)

# Perform a similarity search with k set to a value greater than the number of unique examples
query = "rainny"
k = 10
example_docs = vectorstore.similarity_search(query, k=k)

# Print the returned examples
for doc in example_docs:
    print(doc.metadata)

If the issue is indeed with the Chroma vector store's handling of similarity searches, then a potential solution could be to modify the SemanticSimilarityExampleSelector class to ensure that it only returns unique examples. However, this would likely require changes to the LangChain codebase.

I hope this helps! If you have any further questions or need additional clarification, please don't hesitate to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

langchain-ai / langchain