Closed michaelfeil closed 4 months ago
We used this benchmark for PyTorch and Onxx numbers.
Be aware that the k6 load testing script is benchmarking the whole solution (tokenization + model forward + data H<->D) while the fast-mteb benchmark only benchmarks the model forward.
Regarding the numerical precision, its always hard to evaluate the true impact on downstream tasks. I would advise users to run their own evals. We will also add support for bf16 which can help but don't expect much from it.
@OlivierDehaene I compared TEI on batch inference.
Takeaways:
np.testing.assert_almost_equal(local_resp, remote_resp, 4)
To reproduce start up TEI:
docker run --gpus all -p 8080:80 --pull always ghcr.io/huggingface/text-embeddings-inference:0.2.2 --model-id sentence-transformers/all-MiniLM-L6-v2 --max-client-batch-size 2048 --max-concurrent-requests 64000
Small benchmark script
import json
import timeit
import numpy as np
import requests
from sentence_transformers import SentenceTransformer
LIVE_URL = "http://localhost:8080"
def embedding_live_performance():
sample = [f"Test count {i} {(list(range(i % (384))))} " for i in range(2048)]
json_d = {"input": sample, "model": "model"}
session = requests.Session()
# req = session.get(f"{LIVE_URL}/models")
# assert req.status_code == 200
batch_size = 64 # req.json()["data"]["stats"]["batch_size"]
model_name = "sentence-transformers/all-MiniLM-L6-v2" # req.json()["data"]["id"]
print(f"batch_size is {batch_size}")
model = SentenceTransformer(model_name_or_path=model_name)
def local(data: str):
enc = model.encode(data, batch_size=batch_size)
assert len(enc) == len(data)
return enc
def remote(json_data: bytes):
req = session.post(f"{LIVE_URL}/openai", json=json_data)
assert req.status_code == 200
return req
local_resp = local(sample)
remote_resp = [d["embedding"] for d in remote(json_d).json()["data"]]
np.testing.assert_almost_equal(local_resp, remote_resp, 4) # TEI BREAKS HERE unless
print("Both methods provide the identical output.")
print("Measuring latency via SentenceTransformers")
latency_st = timeit.timeit("local(sample)", number=10, globals=locals())
print("SentenceTransformers latency: ", latency_st)
model = None
print("Measuring latency via requests to TEI")
latency_request = timeit.timeit("remote(json_d)", number=10, globals=locals())
print(f"Request latency: {latency_request}")
assert latency_st * 1.1 > latency_request
if __name__ == "__main__":
embedding_live_performance()
What matters in retrieval is not the precision of the embeddings vector; it is the ranking of the distances between the vectors. If you loose some decimals in precision it doesn't always reflect in the ranking.
Also, this issue is not specific to TEI. You will face the same issue for any graph fusing engine. ONNX, Nvidia TensorRT or Inferentia Neuron to quote a few, all make approximations in favour of speed.
In the case of TEI, it comes from the GELU fusing as cuBLASLt does not use exactly the same GELU approximation as the one that is originally used in Transformers. It could be possible to add an "exact" argument to disable the fusing.
c.f. "Its always hard to evaluate the true impact on downstream tasks. I would advise users to run their own evals." and continue to run them in production!
Thanks for the details. Probably this precision
I would suggest adding a PR that validates, that e.g. Cosine-Distance
of TEI
vs sentence-transformers
is >0.999
- compare to
https://github.com/qdrant/fastembed/pull/54/files
Feature request
@OlivierDehaene First - congrats on the product launch!
You mentioned a benchmark in the top line of the readme.
Motivation
I created a similar project - https://github.com/michaelfeil/infinity and would like to reproduce the results, and benchmark other libaries.