qdrant / fastembed

Fast, Accurate, Lightweight Python library to make State of the Art Embedding
https://qdrant.github.io/fastembed/
Apache License 2.0
1.53k stars 111 forks source link

[Bug]: Changing fastembed version but with the same embedder I have different vector with the same text #373

Open giovannialbero1992 opened 4 weeks ago

giovannialbero1992 commented 4 weeks ago

What happened?

I have two environments one with fastembed with the version 0.3.4 and another one with the version 0.1.3. The embedder used is: https://huggingface.co/intfloat/multilingual-e5-large

What Python version are you on? e.g. python --version

python 3.10.11

Version

0.2.7 (Latest)

What os are you seeing the problem on?

Linux

Relevant stack traces and/or logs

No response

giovannialbero1992 commented 4 weeks ago

I'd share with you a kind of guide to reproduce what I'm observing.

Step 1

run a docker container with python 3.10.11

docker run -d -i -t python:3.10 bash

Step 2

Enter in the docker container getting the container's id with docker ps

docker exec -ti <CONTAINER ID> bash 

Step 3

Install vim

apt update && apt install vim

Step 4

Create a python's file embedder.py and insert this code

from langchain_community.embeddings import FastEmbedEmbeddings

embedder = FastEmbedEmbeddings(model_name="intfloat/multilingual-e5-large")

text = "Hello world"
embedding = embedder.embed_query(text)
print(embedding)

Step 5

Install dependencies

pip install langchain_core==0.1.22
pip install langchain==0.1.4
pip install fastembed==0.1.3

Step 6

Run the script and get the result

python embedder.py

First part of the vector

[0.024819795042276382, -0.023618297651410103, -0.006692419294267893, -0.04708532989025116, 0.0343518927693367, -0.026183584704995155, -0.029025807976722717, 0.041693683713674545, 0.060204412788152695, -0.015606507658958435, 0.02012583799660206, 0.03693017736077309, ...

Step 7

Upgrade the fastembed version

pip install fastembed==0.3.4

Step 8

Run the script and get the result

python embedder.py

First part of the vector

[-0.005152239464223385, 0.005240725819021463, 0.008123699575662613, -0.039657339453697205, 0.009418696165084839, -0.035511959344148636, -0.04110070690512657, 0.03789035230875015, 0.05153501033782959, -0.024316389113664627, 0.037706244736909866, 0.019727017730474472, ...

Step 9

Compare the result

I8dNLo commented 3 weeks ago

Reproduced for me, but last output is: [-0.0010747660417109728, -0.0015742044197395444, 0.01378690730780363, -0.03357434272766113, 0.0050786384381353855 ... First output exactly matches

I8dNLo commented 3 weeks ago

Yap, times change: You are looking at very early release

After release 0.2.0 the behavior stays as it's now. Please use some actual version of fastembed

giovannialbero1992 commented 3 weeks ago

Thanks @I8dNLo for the test. I don't know why you have different vector on the last output but you have a difference anyway.

I checked the code and I observed that in previous version you were prepending query: before to embed the entire query.

giovannialbero1992 commented 3 weeks ago

Unfortunately the update it's disruptive on the RAG system that I've because I have different result. I should plan a migration in a way.