langchain-ai / langchain

šŸ¦œšŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.22k stars 14.72k forks source link

Chroma search with vector and search with text get different result using the same embedding function #25517

Open alphrc opened 3 weeks ago

alphrc commented 3 weeks ago

Checked other resources

Example Code

embedding_function = OpenAIEmbeddings(model="text-embedding-3-large", api_key=os.getenv("OPENAI_API_KEY"))
db = Chroma(persist_directory=directory, embedding_function=embedding_function, collection_name='default')

query = "Some text for query"

# Search with text
results_text = db.similarity_search(query)

# Search with vector
vector = embedding_function.embed_documents([query])[0]
results_vector = db.similarity_search_with_vector(vector)

Error Message and Stack Trace (if applicable)

No response

Description

Using the same embedding function, searching with text and searching with vector would get different results.

System Info

langchain==0.2.11 langchain-chroma==0.1.2 langchain-community==0.2.10 langchain-core==0.2.23 langchain-openai==0.1.17 langchain-text-splitters==0.2.2

max

python 3.9.6

alphrc commented 3 weeks ago

This issue remains even if you use embed_query instead of embed_documents

jhaayush2004 commented 3 weeks ago

It's not the bug in Langchain for sure as I explored Similarity_search as well as similarity_search_by_vector methods using HuggingFaceEmbeddings in Langchain framework itself and got the exactly same results from both.

db = Chroma.from_documents(texts,huggingface_embeddings)
query = "what happened to harry's parents ?"
results_text = db.similarity_search(query)

vector = huggingface_embeddings.embed_documents([query])[0]
results_vector = db.similarity_search_by_vector(vector)

if (results_text == results_vector):
  print("True")
else:
  print("False")

The output which came was "True" .

Issue which might be causing problem:

  1. There is no method named as similarity_search_with_vector in langchain_chroma integrational framework , so replace it with similarity_search_by_vector .
  2. There might be some issue in langchain_openai integration but chances of that are very less . I can't test it because I don't have access to OpenAI paid version.

I hope that most probably resolving the first issue would itself solve your problem. Have a Good Day!