Closed powerpistn closed 5 months ago
@OlivierDehaene I'm having the same issue where embeddings generated locally and via the text-embeddings-router
are slightly differing in significant digits. Is this the expected behavior?
I'm having the issue for the sentence-transformers/all-MiniLM-L6-v2
model so I'm not sure if this is something that is a by-product of not using torch?
Bump on this, We have same issue - we understand that TEI uses it own custom kernels to accelerate inference - However is there a way to control the margin of difference? Or is this an unexpected bug?
1e-3 differences are expected. What matter most for embeddings is if the distance you want to use is stable between devices. For example, with the following GPU and CPU embeddings, using the cosine distance:
import torch
cpu = torch.tensor([-0.070025794, 0.021128502, -0.023149645, 0.0442686, 0.03126164, 0.0050532944, -0.0017524747, -0.003981021, -0.01252058, 0.0014706801])
gpu = torch.tensor([-0.069965556, 0.021150552, -0.02317143, 0.044223882, 0.031274565, 0.005071816, -0.0017547797, -0.0039755385, -0.012537294, 0.0014445356])
torch.nn.functional.cosine_similarity(cpu, gpu, dim=0)
#tensor(1.0000)
Unfortunately for us the differences were quite significant when compared to naive inference with sentence transformers
We don't expect exact precision across devices but at least the numbers should be close on same device when compared to native torch Maybe the models we use are very sensitive that small deviations throw off our retrieval rankings
Futhermore, it is totally plausible that a RAG pipeline uses gpus for fast batch embedding during indexing but cpu for online infrequent embedding. Major deviations across devices also become a problem in such cases
System Info
text-embedding-inference
Information
Tasks
Reproduction
Method 1: I deploy the service using the following method
the request is
the result-1 is
Method 2: use python code
the result-2 is
The significant figures of result_1 and result_2 are different.
I want to know if it is possible to reach the valid bits of result_2 using text-embedding-inference
Expected behavior
Use text-embedding_inference to make result_1 ([[-0.03707749,0.0060151797,-0.06545135,......]])reach the accuracy of result_2 ([[-0.03717041015625, 0.00618743896484375, -0.06524658203125,............]])