Performance Benchmarking ?

lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine

Apache License 2.0

395 stars 112 forks source link

Performance Benchmarking ? #46

Closed NelsonBurton closed 4 years ago

NelsonBurton commented 4 years ago

Hello, I am quite impressed by your 80ms latency for 64 dimensional floats and ~4 million items. What does your infrastructure look like? Does this include parellization using sharding? What hardware type are you using? Is 80ms on a single machine?

I have a similarly sized corpus, 5 million documents, 50 dimensional floats. I wrote a KNN function using a script in Elasticsearch’s Painless language, and am getting about 13 seconds to score by nearest neighbors on a single AWS i3.4xl EC2 instance.

I am curious if using a plugin rather than Painless will give me significantly better performance... but was curious how you ended up with good numbers, before I invest in the plugin approach.

lior-k commented 4 years ago

Hi Nelson, before we created this plugin we used a script to run cosine-similarity. When we switched to the plugin we gained more than an order of magnitude in performance! A script runs in the Elasticsearch level, while this plugin runs inside the internal, highly optimized Luciene level.

80ms was achieved using 4 m4.10xlarge machines and using a cluster with about 50 shards. as a rule of thumb - the more shards you have - latency will be better. But note that throughput will decline. Why? Elasticsearch allocates a CPU core per shard. More shards = more active cores when processing a single search query. less shards - means less cores per query - means we can do more concurrent queries

NelsonBurton commented 4 years ago

I see, very interesting, thank you for the information!

What level of throughput can you serve with this setup? My estimate would be with 4 m4.10xl's , then: 160 virtual Cores, 80 real cores; with 50 shards, roughly 2-3 queries can be served simultaneously, if each query takes 80ms, then about 25-50 Queries Per Second?

Also, do you find the EBS storage on m4 instances provides any IO bottleneck? I am wondering if reading from an SSD could improve performance slightly as well.

As an aside, there may be a small performance optimization opportunity in the plugin. For dot product, the plugin could pre-compute the magnitude of each vector before searching, for both the target and comparision vectors, at search and indexing time respectively, and store it in the N+1 array, so a 50 dimesional vector would have 51 values, where the 51st value is the magnitude. That way magnitude of each vector need not be recomputed for each distance calculation.

lior-k commented 4 years ago

Your calculations are correct. We actually reduced the number of shards to 10 - gaining throughput. We were able to reduce latency by using k-means to divide the corpus to clusters. When we query we query only the X nearest clusters. X is determined beforehand tuning it to provide KNN accuracy of 98%

regarding magnitude - you're right. internally we preferred to normalize all our vectors and use dot-product instead of cosine similarity, thus bypassing the magnitude calculation

NelsonBurton commented 4 years ago

Ah! K-means is a great idea for reducing the number of calculations further, thanks for the tip and answering my questions, very helpful!