jobergum / dense-vector-ranking-performance

Performance evaluation of nearest neighbor search using Vespa, Elasticsearch and Open Distro for Elasticsearch K-NN
Apache License 2.0
116 stars 14 forks source link

OpenDistro Elasticsearch K-NN #4

Open mnagaya opened 4 years ago

mnagaya commented 4 years ago

Is there any plan to compare with OpenDistro Elasticsearch K-NN?

https://opendistro.github.io/for-elasticsearch/features/knn.html

jobergum commented 4 years ago

Hello @mnagaya,

I'm open for contributions using the two mentioned datasets if you have time?

mnagaya commented 4 years ago

I have so much work to do on this month, but I will check it out later.

mnagaya commented 4 years ago

Hi @jobergum,

Opendistro for Elasticsearch K-NN is 10x faster than Vespa in my enviroment, but it is not a comparison with Vespa ANN. Do you have any plan to add the Vespa ANN result?

The Vespa ANN implementation looks very nice.

jobergum commented 4 years ago

Yes, I'm planning to add the approximate soon, too many things! When using an approximate ann version it's important to also compare the recall at what QPS.

image

mnagaya commented 4 years ago

Currently, ef_search is 512 (it's default value) and recall is 0.997600.

$ python3 bin/check-recall.py gist-960-euclidean.hdf5 
Average recall Opendistro for Elastic = 0.997600

Ref: https://opendistro.github.io/for-elasticsearch-docs/docs/knn/settings/#index-settings

jobergum commented 4 years ago

So I have done a run with hnsw enabled for the gist dataset with Vespa using the mentioned 1 x Intel(R) Xeon E5-2680 v3 2.50GHz (Haswell).

https://opendistro.github.io/for-elasticsearch-docs/docs/knn/settings/#index-settings

lists the following default values

Configuring Vespa with

 field vector type tensor<float>(x[960]) {
      indexing: attribute |index
      index {
        hnsw {
          max-links-per-node: 16 
          neighbors-to-explore-at-insert:512  
        }
      } 
    }

Still I need to adjust the ef_search for Vespa upwards from 512 to 3*512 to get comparable recall so there might be some differences in the settings in the two engines. With Vespa the ef_search is a query parameter so it can be tuned on a per request basis. Since I wanted to compare performance at the same recall level I needed to adjust this param upwards.

'yql': 'select * from sources * where [{"targetNumHits":%i, "hnsw.exploreAdditionalHits":%i}]nearestNeighbor(vector,vector);' % (10,3*512)

Recall

Average recall Vespa = 0.994600
Average recall Opendistro for Elastic = 0.995500

Performance Vespa - actual query rate: 53.32 Q/s at Open distro ES - actual query rate: 13.13 Q/s

The open distro memory usage is very high, 35GB compared to Vespa's total including configuration services, search core process and serving container less than 10GB.

jobergum commented 4 years ago

Is there a way to control ef_search in the query with open distro ES @mnagaya ?

mnagaya commented 4 years ago

Is there a way to control ef_search in the query with open distro ES @mnagaya ?

Probably not.

jobergum commented 4 years ago

Great, thanks for confirming @mnagaya, Odd that this parameter cannot be set at run time as it's quite important for exploring to get a decent recall versus performance tradeoff. I'll capture a few more rounds and update the README with the results. Again, thank you for your contribution.

vamshin commented 4 years ago

@jobergum couple of suggestions for benchmarking with opendistro Elasticsearch's k-NN

You might find this link useful for indexing/search tuning https://medium.com/@kumon/how-to-realize-similarity-search-with-elasticsearch-3dd5641b9adb

jobergum commented 4 years ago

@vamshin,

Yes, if you look at the README the benchmarking parameters is explained. A 20 second warmup is included, I also run multiple times before publishing results in the README. Also welcome discussion with developers, like in https://github.com/jobergum/dense-vector-ranking-performance/issues/2.

On merging of segments - Yes, this seem to be a major factor with both ES variants and should be mentioned. In the README we mentioned this ,but also this highlights the weakness of the ES/Lucene indexing architecture with immutable segments.

I also notice that there are very large off-heap allocations for OES, 1M 960 vectors touches 55G (Heap is 16G).

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                      
    1 elastic+  20   0   54.8g  34.6g   4.9g S 100.7 13.8 326:12.45 java         
00007f93eaabb000  120180  119944       0 r--s- _bn.cfs
00007f924c000000  130272  130272  130272 rw---   [ anon ]
00007f8ecc000000  131072  130696  130696 rw---   [ anon ]
00007f8e5c000000  131072  131072  131072 rw---   [ anon ]
00007f8e64000000  131072  131072  131072 rw---   [ anon ]
00007f8e6c000000  131072  131072  131072 rw---   [ anon ]
00007f8e74000000  131072  131072  131072 rw---   [ anon ]
00007f8e7c000000  131072  131072  131072 rw---   [ anon ]
00007f8e84000000  131072  131072  131072 rw---   [ anon ]
00007f8e8c000000  131072  131072  131072 rw---   [ anon ]
00007f8e94000000  131072  131072  131072 rw---   [ anon ]
00007f8e9c000000  131072  131072  131072 rw---   [ anon ]
00007f8ea4000000  131072  131072  131072 rw---   [ anon ]
00007f8eac000000  131072  131072  131072 rw---   [ anon ]
00007f8eb4000000  131072  131072  131072 rw---   [ anon ]
00007f8ebc000000  131072  131072  131072 rw---   [ anon ]
00007f8ec4000000  131072  131072  131072 rw---   [ anon ]
00007f8ed4000000  131072  131072  131072 rw---   [ anon ]
00007f8edc000000  131072  131072  131072 rw---   [ anon ]
00007f8ee4000000  131072  131072  131072 rw---   [ anon ]
00007f8eec000000  131072  131072  131072 rw---   [ anon ]
00007f8ef4000000  131072  131072  131072 rw---   [ anon ]
00007f8efc000000  131072  131072  131072 rw---   [ anon ]
00007f8f04000000  131072  131072  131072 rw---   [ anon ]
00007f8f0c000000  131072  131072  131072 rw---   [ anon ]
00007f8f14000000  131072  131072  131072 rw---   [ anon ]
00007f8f1c000000  131072  131072  131072 rw---   [ anon ]
00007f8f24000000  131072  131072  131072 rw---   [ anon ]
00007f8f2c000000  131072  131072  131072 rw---   [ anon ]
00007f8f34000000  131072  131072  131072 rw---   [ anon ]
00007f8f3c000000  131072  131072  131072 rw---   [ anon ]
00007f9244000000  131072  131072  131072 rw---   [ anon ]
00007f92e0000000  131072  131072  131072 rw---   [ anon ]
00007f92e8000000  131072  131072  131072 rw---   [ anon ]
00007f92f0000000  131072  131072  131072 rw---   [ anon ]
00007f92f8000000  131072  131072  131072 rw---   [ anon ]
00007f9300000000  131072  131072  131072 rw---   [ anon ]
00007f9308000000  131072  131072  131072 rw---   [ anon ]
00007f9310000000  131072  131072  131072 rw---   [ anon ]
00007f9338000000  131072  131072  131072 rw---   [ anon ]
00007f9394000000  131072  131072  131072 rw---   [ anon ]
00007f93b8000000  131072  131072  131072 rw---   [ anon ]
00007f92bde5e000  142268  141992       0 r--s- _78.cfs
00007f921ec1e000  142596  142596  142596 rw---   [ anon ]
00007f929edb8000  149792  149648       0 r--s- _7i.cfs
00007f922775f000  162000  161688       0 r--s- _a0.cfs
00007f92d14d2000  175288  174932       0 r--s- _ey.cfs
00007f97e805b000  261780  260620  260620 rw---   [ anon ]
00007f9353a4a000  267992  267656       0 r--s- _da.cfs
00007f92732a0000  338440  338080       0 r--s- _7u.cfs
00007f920573d000  369420  368952       0 r--s- _9g.cfs
00007f91902c1000  397932  397448       0 r--s- _b1.cfs
00007f9156a1b000  470948  470500       0 r--s- _dz.cfs
00007f9173604000  471796  471336       0 r--s- _cf.cfs
00007f91a875c000  475012  475012       0 r--s- _6c_Lucene80_0.dvd
00007f91c573d000 1048576 1048576       0 r--s- _6c_Lucene80_0.dvd
00007f908a562000 3347172 3347172 3347172 rw---   [ anon ]
0000000400000000 16777216 16777216 16777216 rw---   [ anon ]
total kB         57421916 36263388 3117405
vamshin commented 4 years ago

Thanks @jobergum.

Graphs are loaded outside ES heap, so we could take advantage of bigger instances with more RAM.

For 1M vectors with 960 dimensions k-NN graph memory usage would be in the order of (4d+8M) Bytes/vector

For 1M vectors, graph usage should be (4*960 + 8*16)*1M bytes = ~4GB.

Total memory usage should be from ES heap + graph memory usage + lucene.

mnagaya commented 4 years ago

@jobergum

Have you increase value of "-n" parameter in vespa-fbench?

I got the following results in my environment.

Server 2 x Intel(R) Xeon Silver 4114 2.20 GHz (Skylake).

Results Engine QPS Average Latency (ms) 95P Latency (ms) Recall@10
OES 592.09 84.28 126.84 0.989400
Vespa 492.95 101.31 160.80 0.996300

Thanks,

jobergum commented 4 years ago

Before merge segments (which took more than 2 hours)

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                      
    1 elastic+  20   0   54.8g  34.6g   4.9g S 100.7 13.8 326:12.45 java     

After:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                      
    1 elastic+  20   0   53.9g  29.9g  31944 S   7.0 11.9 439:55.03 java         

16G is from the heap, remaining is off-heap. After merging segments the recall also changes with the given index time setting for ef_search from 0.995500 with more segments (one hnsw graph per segment?) to 0.979600 with a single segment which also makes the recall more comparable with Vespa with the same ef_search setting (Vespa's query time setting "hnsw.exploreAdditionalHits")

Average recall Vespa = 0.980400
Average recall Opendistro for Elastic = 0.979600

With lower recall and less segments to search with the single threaded query execution over N segments of ES the performance improves significantly

Vespa: actual query rate:         95.26 Q/s
Open ES actual query rate:         84.79 Q/s

All details https://gist.github.com/jobergum/ff46385c44bbb1683be16d98a8eed6ba

@mnagaya here as in several ann benchmarks (e.g http://ann-benchmarks.com/) one reports QPS as a function of the average latency with 1 client. E.g 10 ms latency average gives 100 QPS. This gives a baseline, but I agree that it's also nice to prove that the technology uses scales with increased concurrency as I'm pretty sure many of the libraries benchmarked in ann-benchmarks will have issues with contention at high concurrency.

vamshin commented 4 years ago

@jobergum thank you.

Did you run with the following suggestion in the ES query(avoid reading stored fields and just read scores and doc ids). Looks like all we need is the neighbors for recall?

Example as mentioned in above comments

{
"size": 5,
"stored_fields": "_none_",
"docvalue_fields": ["_id"],
"query": {
  "knn": {
   "v": {            
     "vector": [-0.16490704,-0.047262248,-0.078923926],
     "k": 50
   }       
  }
}
}

This would improve performance from our performance tests. Would be great if you could run with the above suggestion

jobergum commented 4 years ago

@vamshin The last quoted run above with 85 QPS was with the suggested stored_fields and docvalue_fields yes. I have not had time to format and re-run the results to update the README yet. When OES has merged the segments into 1 the performance of the two engines is almost equal. Though, the impact the number of segments have on search performance is in OES is mind blowing, also the time it takes to merge segments which will make a big impact in a real time feed scenario.

vamshin commented 4 years ago

Though, the impact the number of segments have on search performance is in OES is mind blowing, also the time it takes to merge segments which will make a big impact in a real time feed scenario.

@jobergum We can control the number of segments creation right at indexing by disabling the refresh interval or having a large refresh interval. Once all the indexing is done, we could then call refresh to create segments. This way we end up with minimal segments.

Would love to see the results in README. thank you.