Open 0110G opened 2 months ago
For reference, our benchmark of fastembed is here - https://colab.research.google.com/github/qdrant/fastembed/blob/main/experiments/Throughput_Across_Models.ipynb
I would have to try your version to tell for sure what's the difference, but at the first glance you are encoding one sentence at a time, while our benchmarks are in batches
I am also computing batch wise (batch size=512):
sentences = [["Some arbitrary sentence 1"]*512, ["Some arbitrary sentence 2"]*512]
Complete python benchmarking code:
import random
import time
from sentence_transformers import SentenceTransformer
from fastembed import TextEmbedding
if __name__ == '__main__':
iter_count = 50
batch_size = 512
sentences = [["biblestudytools kjv romans 6"]*512, ["MS Dhoni is one of the best wicket keeper in the world"]*512] #Standard requires: 39.150851249694824s
# Sentence transformers
model_standard = SentenceTransformer("all-MiniLM-L6-v2")
fast_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
start_time = time.time()
for i in range(iter_count):
model_standard.encode(random.sample(sentences, 1)[0])
time_standard = time.time() - start_time
print("Standard requires: {}s".format(time_standard))
print("{} processed per sec".format(batch_size*iter_count/time_standard))
start_time = time.time()
for i in range(iter_count):
list(fast_model.embed(random.sample(sentences, 1)[0]))
time_standard = time.time() - start_time
print("Fast requires: {}s".format(time_standard))
print("{} processed per sec".format(batch_size*iter_count/time_standard))
Output:
Standard requires: 21.204905033111572s
1207.267844870112 processed per sec
Fast requires: 25.721112966537476s
995.2913014808091 processed per sec
Thanks for sharing, we will look into it!
@0110G
Refactored the testing script a bit, here are my results: https://colab.research.google.com/drive/1SroKOUZ0iYN1vo2mRXdhIQeVyy0RWQTG?usp=sharing
It uses internal batching instead of external loop, as both libraries actually provide the interface capable of creating batches internally. If your use-case requires different batching, it apparently might not work so well with fastembed.
Additionally, tried a different scenario of inferencing individual queries, data-parallel approach and running on higher CPU machine (default colab has 2 cpus, but higher tier has 8)
My use case involves constanly consuming messages from a stream, in a batch size (configurable), computing embeddings and doing some computation and writing it to a db. Therefore your approach is not fit for my use case
Seems like fast embed is not so fast after all.
@0110G
I think I understood the problem: when you call embed
function in fastembed, it spawns workers each time. So, it would create an overhead.
I tried to convert fastembed version into steaming with python generators, so the embed
function is only called once: https://colab.research.google.com/drive/1X03qTpBVNGDYs82CztfpqF2JOq_-75hK?usp=sharing
Please let me know if this option is closer to your use-case.
This works but I am not getting the similar results to what you showed on collab. Sentence transormers is still faster for me. I find this absurd how can onnx model be slower than the actual implemenation
hi @0110G
Actually, I've encountered several cases, when onnx model was slower on mac os, the issue might be in onnxruntime
I was running colab on a higher tier machine with 8cpu, it might be the reason
What happened?
On benchmarking synchronous computation times for generating embeddings for
Using sentence transformers: ~1300 msgs per sec
VS
I am using fastembed 0.3.3
Why is this working so slow wrt original impl.? What can I do to improve performance ?
What Python version are you on? e.g. python --version
3.9.16
Version
0.2.7 (Latest)
What os are you seeing the problem on?
MacOS
Relevant stack traces and/or logs
No response