Closed zlnguyen closed 4 months ago
🤖
Hey @zlnguyen, great to see you diving into some more LangChain intricacies! How's everything going on your end?
Thank you for providing such a detailed report. It seems like the issue you're experiencing is related to the SemanticSimilarityExampleSelector
class when used with the Chroma
vector store. Specifically, it's returning duplicate examples when the number of examples (k
) is set to a value greater than the number of unique examples in the vector store.
This behavior is likely due to how the Chroma
vector store handles similarity searches. When the similarity_search
method is called, it retrieves the k
most similar examples from the vector store. If there are fewer unique examples than k
, it's possible that the same example could be returned multiple times.
One way to confirm this would be to check the behavior of the Chroma
vector store's similarity_search
method directly. You could do this by creating a Chroma
vector store with your examples, adding them to the store, and then performing a similarity search with k
set to a value greater than the number of unique examples. If the same example is returned multiple times, then this would confirm that the issue is with the Chroma
vector store's handling of similarity searches.
Here's an example of how you could do this:
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import Chroma
examples = [
{"input": "happy", "output": "sad"},
{"input": "tall", "output": "short"},
{"input": "energetic", "output": "lethargic"},
{"input": "sunny", "output": "gloomy"},
{"input": "slow", "output": "fast"},
{"input": "windy", "output": "calm"},
]
# Convert examples to strings
string_examples = [" ".join(sorted_values(eg)) for eg in examples]
# Create a Chroma vector store with the examples
vectorstore = Chroma.from_texts(
string_examples,
VertexAIEmbeddings("textembedding-gecko@001"),
metadatas=examples
)
# Perform a similarity search with k set to a value greater than the number of unique examples
query = "rainny"
k = 10
example_docs = vectorstore.similarity_search(query, k=k)
# Print the returned examples
for doc in example_docs:
print(doc.metadata)
If the issue is indeed with the Chroma
vector store's handling of similarity searches, then a potential solution could be to modify the SemanticSimilarityExampleSelector
class to ensure that it only returns unique examples. However, this would likely require changes to the LangChain codebase.
I hope this helps! If you have any further questions or need additional clarification, please don't hesitate to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Checked other resources
Example Code
The following code produces the duplicated examples of
sunny
. It is output correctly using FAISS.Output looks like the below
Error Message and Stack Trace (if applicable)
No response
Description
Below is the output using FAISS. I do not expect the output to be the same but at least it should not contain duplicates. If we put k=10, it duplicates the example and give more examples than the original list.
Below is the output when k=10
System Info
System Information
Package Information
Packages not installed (Not Necessarily a Problem)
The following packages were not found: