lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Apache License 2.0
395 stars 112 forks source link

Algorithm behind this plugin #30

Closed snakeztc closed 5 years ago

snakeztc commented 5 years ago

I am curious about the algorithm behind this plugin in order to achieve fast speed. Does it implement any sort of approximate-nearest neighbour algorithms, OR it's actually compute scores over the entire corpus using brutal force and speeding up via low-level code implementation? Thanks!

lior-k commented 5 years ago

Good question. The answer: It's the later. Brute Force on the entire corpus using a low level Lucene API. This is inorder to keep this minimal, so it'll serve as a base for other algorithms like LSH, K-means or others to be built ontop.

Internally we use K-means to divide the corpus to 200 clusters and store that information inside the ES document, so each has a vector and a cluster number. Next, When querying for knn vectors we search only in the 20 most relevant clusters (using ES filtered query). So basically we search only 10% of the corpus

On Sun, Sep 8, 2019, 12:08 AM Tiancheng Zhao (Tony) < notifications@github.com> wrote:

I am curious about the algorithm behind this plugin in order to achieve fast speed. Does it implement any sort of approximate-nearest neighbour algorithms, OR it's actually compute scores over the entire corpus using brutal force and speeding up via low-level code implementation? Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lior-k/fast-elasticsearch-vector-scoring/issues/30?email_source=notifications&email_token=ABGGISBJFXHJTESSL6YK24DQIQJ43A5CNFSM4IURJSCKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HJ7GDKA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGGISFYJEHIG5TTLFYWRZTQIQJ43ANCNFSM4IURJSCA .

snakeztc commented 5 years ago

Thanks for the quick answer! Very useful. I tried the plugin with 2 million vectors on a single node ES (vector size 600) it took about 4-5 seconds to return. Any ideas on how to speed up without implementing an ANN?

lior-k commented 5 years ago

Split the Elasticsearch index to more shards. Rule of thumb - more shards = lower latency (since the problem is reduced to more workers) = lower throughput (since in any request you fire up more Cores)

I would also normalize the vectors before indexing and during search time call the plug-in with "cosine: false"

On Mon, Sep 9, 2019, 3:02 AM Tiancheng Zhao (Tony) notifications@github.com wrote:

Thanks for the quick answer! Very useful. I tried the plugin with 2 million vectors on a single node ES (vector size 600) it took about 4-5 seconds to return. Any ideas on how to speed up without implementing an ANN?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lior-k/fast-elasticsearch-vector-scoring/issues/30?email_source=notifications&email_token=ABGGISA6FWDJ7ASIPKDFGGLQIWHCHA5CNFSM4IURJSCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6F5OLY#issuecomment-529258287, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGGISFB54LSOU626A3SP2TQIWHCHANCNFSM4IURJSCA .

snakeztc commented 5 years ago

Thanks!