langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.55k stars 14.82k forks source link

mxbai-embed-large embedding not consistent with original paper #24357

Open jeugregg opened 2 months ago

jeugregg commented 2 months ago

Checked other resources

Example Code

from langchain_community.embeddings import OllamaEmbeddings
from sentence_transformers.util import cos_sim
import numpy as np
from numpy.testing import assert_almost_equal
# definitions
ollama_emb = OllamaEmbeddings(model='mxbai-embed-large')

# test on ollama
query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'

docs = [
    query,
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

r_1 = ollama_emb.embed_documents(docs)

# Calculate cosine similarity
similarities = cos_sim(r_1[0], r_1[1:])
print(similarities.numpy()[0])
print("to be compared to :\n [0.7920, 0.6369, 0.1651, 0.3621]")
try :
    assert_almost_equal( similarities.numpy()[0], np.array([0.7920, 0.6369, 0.1651, 0.3621]),decimal=2)
    print("TEST 1 : OLLAMA PASSED.")
except AssertionError:
    print("TEST 1 : OLLAMA FAILED.")

Error Message and Stack Trace (if applicable)

No response

Description

THe test is not working well. It works with ollama directly but not with ollama under Langchain. Also, it works well with Llamafile under Langchain. The issue seems to be the same than here : https://github.com/ollama/ollama/issues/4207 Why is it not fixed with langchain?

System Info

System Information

OS: Darwin OS Version: Darwin Kernel Version 23.5.0: Wed May 1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030 Python Version: 3.10.4 (main, Mar 31 2022, 03:37:37) [Clang 12.0.0 ]

Package Information

langchain_core: 0.2.20 langchain: 0.2.8 langchain_community: 0.2.7 langsmith: 0.1.88 langchain_chroma: 0.1.1 langchain_text_splitters: 0.2.2

ollama : 0.2.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph langserve

jeugregg commented 2 months ago

Actually, the issue is that by default langchain add an option : embed_instruction embed_instruction: str = "passage: " And it kills everything with 'mxbai-embed-large'. So to pass the test we need to add this when declaring model :

ollama_emb = OllamaEmbeddings(model="mxbai-embed-large", embed_instruction="")

I don't know if it is a good example to be actually accurate. What do you think that we need to use? I am going to try :

The mxbai-embed-large blog says to use :

- for embedding docs : embed_instruction = ""
- for query : query_instruction =  "Represent this sentence for searching relevant passages: "

It works well with that.