huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.84k stars 177 forks source link

Script for reproducing the Benchmark #2

Closed michaelfeil closed 4 months ago

michaelfeil commented 1 year ago

Feature request

@OlivierDehaene First - congrats on the product launch!

You mentioned a benchmark in the top line of the readme.

Motivation

I created a similar project - https://github.com/michaelfeil/infinity and would like to reproduce the results, and benchmark other libaries.

OlivierDehaene commented 1 year ago

We used this benchmark for PyTorch and Onxx numbers.

Be aware that the k6 load testing script is benchmarking the whole solution (tokenization + model forward + data H<->D) while the fast-mteb benchmark only benchmarks the model forward.

Regarding the numerical precision, its always hard to evaluate the true impact on downstream tasks. I would advise users to run their own evals. We will also add support for bf16 which can help but don't expect much from it.

michaelfeil commented 1 year ago

@OlivierDehaene I compared TEI on batch inference.

Takeaways:

To reproduce start up TEI:

docker run --gpus all -p 8080:80 --pull always ghcr.io/huggingface/text-embeddings-inference:0.2.2 --model-id sentence-transformers/all-MiniLM-L6-v2 --max-client-batch-size 2048 --max-concurrent-requests 64000

Small benchmark script

import json
import timeit

import numpy as np
import requests
from sentence_transformers import SentenceTransformer

LIVE_URL = "http://localhost:8080"

def embedding_live_performance():
    sample = [f"Test count {i} {(list(range(i % (384))))} " for i in range(2048)]

    json_d = {"input": sample, "model": "model"}
    session = requests.Session()
    # req = session.get(f"{LIVE_URL}/models")
    # assert req.status_code == 200

    batch_size = 64 # req.json()["data"]["stats"]["batch_size"]
    model_name = "sentence-transformers/all-MiniLM-L6-v2" # req.json()["data"]["id"]
    print(f"batch_size is {batch_size}")
    model = SentenceTransformer(model_name_or_path=model_name)

    def local(data: str):
        enc = model.encode(data, batch_size=batch_size)
        assert len(enc) == len(data)
        return enc

    def remote(json_data: bytes):
        req = session.post(f"{LIVE_URL}/openai", json=json_data)
        assert req.status_code == 200
        return req

    local_resp = local(sample)
    remote_resp = [d["embedding"] for d in remote(json_d).json()["data"]]
    np.testing.assert_almost_equal(local_resp, remote_resp, 4) # TEI BREAKS HERE unless
    print("Both methods provide the identical output.")

    print("Measuring latency via SentenceTransformers")
    latency_st = timeit.timeit("local(sample)", number=10, globals=locals())
    print("SentenceTransformers latency: ", latency_st)
    model = None

    print("Measuring latency via requests to TEI")
    latency_request = timeit.timeit("remote(json_d)", number=10, globals=locals())
    print(f"Request latency: {latency_request}")

    assert latency_st * 1.1 > latency_request

if __name__ == "__main__":
    embedding_live_performance()
OlivierDehaene commented 1 year ago

What matters in retrieval is not the precision of the embeddings vector; it is the ranking of the distances between the vectors. If you loose some decimals in precision it doesn't always reflect in the ranking.

Also, this issue is not specific to TEI. You will face the same issue for any graph fusing engine. ONNX, Nvidia TensorRT or Inferentia Neuron to quote a few, all make approximations in favour of speed.

In the case of TEI, it comes from the GELU fusing as cuBLASLt does not use exactly the same GELU approximation as the one that is originally used in Transformers. It could be possible to add an "exact" argument to disable the fusing.

c.f. "Its always hard to evaluate the true impact on downstream tasks. I would advise users to run their own evals." and continue to run them in production!

michaelfeil commented 1 year ago

Thanks for the details. Probably this precision

I would suggest adding a PR that validates, that e.g. Cosine-Distance of TEI vs sentence-transformers is >0.999 - compare to https://github.com/qdrant/fastembed/pull/54/files