huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.84k stars 177 forks source link

the significant figures of embedding #234

Closed powerpistn closed 5 months ago

powerpistn commented 7 months ago

System Info

text-embedding-inference

Information

Tasks

Reproduction

Method 1: I deploy the service using the following method

model=/workspace/bge-m3
revision=refs/pr/5
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.1 --model-id $model --revision $revision

the request is

curl 127.0.0.1:8080/embed     -X POST     -d '{"inputs": "你好"}'     -H 'Content-Type: application/json'

the result-1 is

[[-0.03707749,0.0060151797,-0.06545135,......]]

Method 2: use python code

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('/workspace/bge-m3',use_fp16=True,device='cuda:0') # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["你好"]
embeddings_1 = model.encode(sentences_1,
                            batch_size=12,
                            max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                            )['dense_vecs']
print(embeddings_1.tolist())

the result-2 is

[[-0.03717041015625, 0.00618743896484375, -0.06524658203125,............]]

The significant figures of result_1 and result_2 are different.

result_1:[[-0.03707749,0.0060151797,-0.06545135,......]]

result_2:[[-0.03717041015625, 0.00618743896484375, -0.06524658203125,............]]

I want to know if it is possible to reach the valid bits of result_2 using text-embedding-inference

Expected behavior

Use text-embedding_inference to make result_1 ([[-0.03707749,0.0060151797,-0.06545135,......]])reach the accuracy of result_2 ([[-0.03717041015625, 0.00618743896484375, -0.06524658203125,............]])

vrdn-23 commented 6 months ago

@OlivierDehaene I'm having the same issue where embeddings generated locally and via the text-embeddings-router are slightly differing in significant digits. Is this the expected behavior?

I'm having the issue for the sentence-transformers/all-MiniLM-L6-v2 model so I'm not sure if this is something that is a by-product of not using torch?

chiragjn commented 5 months ago

Bump on this, We have same issue - we understand that TEI uses it own custom kernels to accelerate inference - However is there a way to control the margin of difference? Or is this an unexpected bug?

OlivierDehaene commented 5 months ago

1e-3 differences are expected. What matter most for embeddings is if the distance you want to use is stable between devices. For example, with the following GPU and CPU embeddings, using the cosine distance:

import torch

cpu = torch.tensor([-0.070025794, 0.021128502, -0.023149645, 0.0442686, 0.03126164, 0.0050532944, -0.0017524747, -0.003981021, -0.01252058, 0.0014706801])

gpu = torch.tensor([-0.069965556, 0.021150552, -0.02317143, 0.044223882, 0.031274565, 0.005071816, -0.0017547797, -0.0039755385, -0.012537294, 0.0014445356])

torch.nn.functional.cosine_similarity(cpu, gpu, dim=0)
#tensor(1.0000)
chiragjn commented 5 months ago

Unfortunately for us the differences were quite significant when compared to naive inference with sentence transformers

We don't expect exact precision across devices but at least the numbers should be close on same device when compared to native torch Maybe the models we use are very sensitive that small deviations throw off our retrieval rankings


Futhermore, it is totally plausible that a RAG pipeline uses gpus for fast batch embedding during indexing but cpu for online infrequent embedding. Major deviations across devices also become a problem in such cases