lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Apache License 2.0
395 stars 112 forks source link

Document retrieval is slow #11

Closed c-chaitanya closed 6 years ago

c-chaitanya commented 6 years ago

The plugin works great for similarity on small data, recently, for my use case i indexed about 1,50,000 to my elasticsearch and tried performing searches. A small change was, instead of using averaged word2vec of google , I used infersent of facebook(a doc2vec model) to get vectors for my sentences. The search time takes between 7 to 9 sec to retrieve answers. The mapping I use is same as that of in readme. My search query in python is as follows

search = self.es.search(index=Config.faq_index, body={
                "query": {"function_score": {"query": {"bool": {"filter": {"term": {"account_id": account_id}}}},
                                             "boost_mode": "replace",
                                             "script_score": {
                                                 "script": {
                                                     "inline": "binary_vector_score",
                                                     "lang": "knn",
                                                     "params": {
                                                         "cosine": True,
                                                         "field": "embedding_vector",
                                                         "vector": vector_array
                                                     }}}}}, "size": 10})

Can you suggest an approach to increase speed to be as good as your, you were able to search through 40 million documents in 0.8 sec?

lior-k commented 6 years ago

Yes, with multiple nodes. it was 10 ES data nodes to my recall. What is the size of your index and the # of documents in it?

On Wed, Sep 5, 2018, 2:01 PM Chaitanya notifications@github.com wrote:

The plugin works great for similarity on small data, recently, for my use case i indexed about 1,50,000 to my elasticsearch and tried performing searches. A small change was, instead of using averaged word2vec of google , I used infersent of facebook(a doc2vec model) to get vectors for my sentences. The search time takes between 7 to 9 sec to retrieve answers. The mapping I use is same as that of in readme. My search query in python is as follows

search = self.es.search(index=Config.faq_index, body={ "query": {"function_score": {"query": {"bool": {"filter": {"term": {"account_id": account_id}}}}, "boost_mode": "replace", "script_score": { "script": { "inline": "binary_vector_score", "lang": "knn", "params": { "cosine": True, "field": "embedding_vector", "vector": vector_array }}}}}, "size": 10})

Can you suggest an approach to increase speed to be as good as your, you were able to search through 40 million documents in 0.8 sec?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lior-k/fast-elasticsearch-vector-scoring/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AExkSJCIpPtSBZRNtUG9KQX8rELmFCHmks5uX677gaJpZM4Waq82 .

c-chaitanya commented 6 years ago

Hi, Thanks for the quick reply. In my case each document has about 5 fields (one field being the vector embedding), and the total number of documents are around 1,50,000. In my use case i just installed elasticsearch on my machine(just one node i guess) and started testing. Would this happen to have an impact on retrieval speed?. And after indexing all the documents the total size occupied is around 5 to 10GB.

lior-k commented 6 years ago

Yes - the numbers of ES nodes afects the retreival speed in a linear fashion. 2 nodes instead of 1 will give you twice the speed.

Also - the number of shards matters - it affects the latency vs throughput: this is something you have to play with and benchmark on your env. A rule of thumb: More shards means better latency but worse throughput.

c-chaitanya commented 6 years ago

Thank you for the quick and prompt answer, I guess having multiple nodes solves the issue. The answer was also educative Thanks again!!

clark010 commented 5 years ago

@lior-k How many cpu vcore and memory you configured for one es node ? I guess all the data should in memory cache.

lior-k commented 5 years ago

The plugin reads only the vector field from the document. For performance you should have enough memory to keep all the vectors in memory. So for example, let's say you have 10GB of vectors in your entire corpuse and 2 nodes - each node must store ~5GB in memory. Note that this is not only ES cache memory, but also OS file system cache (since ES relays heavily on OS cache). I recommend having at least 50% free memory in your OS for cache.

In my case I now have 10 m4.10xl nodes, for a corpus of more than 500GB (from which about 100GB are vectors)

I recommend you benchmark with different settings. Mark the latency and throughput. Also check the "minor page faults" parameter - see if adding more memory minimizes it.

On Fri, Dec 14, 2018, 4:23 AM uniman notifications@github.com wrote:

@lior-k https://github.com/lior-k How many cpu vcore and memory you configured for one es node ? I guess all the data should in memory cache.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lior-k/fast-elasticsearch-vector-scoring/issues/11#issuecomment-447190938, or mute the thread https://github.com/notifications/unsubscribe-auth/AExkSCym_Byp0yqPDuCWyR4_BWlWC3aPks5u4wuggaJpZM4Waq82 .