[Bug/Model Request]: Is slower than sentence transformer for all-minilm-l6-v2

0110G commented 2 months ago

What happened?

On benchmarking synchronous computation times for generating embeddings for

Using sentence transformers: ~1300 msgs per sec

from sentence_transformers import SentenceTransformer
model_standard = SentenceTransformer("all-MiniLM-L6-v2")

start_time = time.time()
for i in range(iter_count):
    model_standard.encode(random.sample(sentences, 1)[0])
time_standard = time.time() - start_time
print("Standard requires: {}s".format(time_standard))
print("{} processed per sec".format(batch_size*iter_count/time_standard))

VS

Using FastEmbed (Synchronously): 800 msgs per sec

fast_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
start_time = time.time()
for i in range(iter_count):
    list(fast_model.embed(random.sample(sentences, 1)[0]))
time_standard = time.time() - start_time
print("Fast requires: {}s".format(time_standard))
print("{} processed per sec".format(batch_size*iter_count/time_standard))

I am using fastembed 0.3.3

pip show fastembed
Name: fastembed
Version: 0.3.3
Summary: Fast, light, accurate library built for retrieval embedding generation
Home-page: https://github.com/qdrant/fastembed
Author: Qdrant Team
Author-email: info@qdrant.tech
License: Apache License
Location: /Users/<>/PycharmProjects/Voyager/venv/lib/python3.9/site-packages
Requires: tqdm, PyStemmer, numpy, mmh3, onnxruntime, pillow, onnx, loguru, tokenizers, huggingface-hub, snowballstemmer, requests
Required-by:

Why is this working so slow wrt original impl.? What can I do to improve performance ?

What Python version are you on? e.g. python --version

3.9.16

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

No response

generall commented 2 months ago

For reference, our benchmark of fastembed is here - https://colab.research.google.com/github/qdrant/fastembed/blob/main/experiments/Throughput_Across_Models.ipynb

I would have to try your version to tell for sure what's the difference, but at the first glance you are encoding one sentence at a time, while our benchmarks are in batches

0110G commented 2 months ago

I am also computing batch wise (batch size=512):

sentences = [["Some arbitrary sentence 1"]*512, ["Some arbitrary sentence 2"]*512]

0110G commented 2 months ago

Complete python benchmarking code:

import random
import time

from sentence_transformers import SentenceTransformer
from fastembed import TextEmbedding

if __name__ == '__main__':
    iter_count = 50
    batch_size = 512
    sentences = [["biblestudytools kjv romans 6"]*512, ["MS Dhoni is one of the best wicket keeper in the world"]*512] #Standard requires: 39.150851249694824s

    # Sentence transformers
    model_standard = SentenceTransformer("all-MiniLM-L6-v2")
    fast_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

    start_time = time.time()
    for i in range(iter_count):
        model_standard.encode(random.sample(sentences, 1)[0])
    time_standard = time.time() - start_time
    print("Standard requires: {}s".format(time_standard))
    print("{} processed per sec".format(batch_size*iter_count/time_standard))

    start_time = time.time()
    for i in range(iter_count):
        list(fast_model.embed(random.sample(sentences, 1)[0]))
    time_standard = time.time() - start_time
    print("Fast requires: {}s".format(time_standard))
    print("{} processed per sec".format(batch_size*iter_count/time_standard))

Output:


Standard requires: 21.204905033111572s
1207.267844870112 processed per sec
Fast requires: 25.721112966537476s
995.2913014808091 processed per sec

generall commented 2 months ago

Thanks for sharing, we will look into it!

generall commented 2 months ago

@0110G

Refactored the testing script a bit, here are my results: https://colab.research.google.com/drive/1SroKOUZ0iYN1vo2mRXdhIQeVyy0RWQTG?usp=sharing

It uses internal batching instead of external loop, as both libraries actually provide the interface capable of creating batches internally. If your use-case requires different batching, it apparently might not work so well with fastembed.

Additionally, tried a different scenario of inferencing individual queries, data-parallel approach and running on higher CPU machine (default colab has 2 cpus, but higher tier has 8)

0110G commented 2 months ago

My use case involves constanly consuming messages from a stream, in a batch size (configurable), computing embeddings and doing some computation and writing it to a db. Therefore your approach is not fit for my use case

Seems like fast embed is not so fast after all.

generall commented 2 months ago

@0110G I think I understood the problem: when you call embed function in fastembed, it spawns workers each time. So, it would create an overhead.

I tried to convert fastembed version into steaming with python generators, so the embed function is only called once: https://colab.research.google.com/drive/1X03qTpBVNGDYs82CztfpqF2JOq_-75hK?usp=sharing

Please let me know if this option is closer to your use-case.

0110G commented 2 months ago

This works but I am not getting the similar results to what you showed on collab. Sentence transormers is still faster for me. I find this absurd how can onnx model be slower than the actual implemenation

joein commented 2 months ago

hi @0110G

Actually, I've encountered several cases, when onnx model was slower on mac os, the issue might be in onnxruntime

generall commented 2 months ago

I was running colab on a higher tier machine with 8cpu, it might be the reason

qdrant / fastembed