Open mnagaya opened 4 years ago
Hello @mnagaya,
I'm open for contributions using the two mentioned datasets if you have time?
I have so much work to do on this month, but I will check it out later.
Hi @jobergum,
Opendistro for Elasticsearch K-NN is 10x faster than Vespa in my enviroment, but it is not a comparison with Vespa ANN. Do you have any plan to add the Vespa ANN result?
The Vespa ANN implementation looks very nice.
Yes, I'm planning to add the approximate soon, too many things! When using an approximate ann version it's important to also compare the recall at what QPS.
Currently, ef_search is 512 (it's default value) and recall is 0.997600.
$ python3 bin/check-recall.py gist-960-euclidean.hdf5
Average recall Opendistro for Elastic = 0.997600
Ref: https://opendistro.github.io/for-elasticsearch-docs/docs/knn/settings/#index-settings
So I have done a run with hnsw enabled for the gist dataset with Vespa using the mentioned 1 x Intel(R) Xeon E5-2680 v3 2.50GHz (Haswell).
https://opendistro.github.io/for-elasticsearch-docs/docs/knn/settings/#index-settings
lists the following default values
Configuring Vespa with
field vector type tensor<float>(x[960]) {
indexing: attribute |index
index {
hnsw {
max-links-per-node: 16
neighbors-to-explore-at-insert:512
}
}
}
Still I need to adjust the ef_search for Vespa upwards from 512 to 3*512 to get comparable recall so there might be some differences in the settings in the two engines. With Vespa the ef_search is a query parameter so it can be tuned on a per request basis. Since I wanted to compare performance at the same recall level I needed to adjust this param upwards.
'yql': 'select * from sources * where [{"targetNumHits":%i, "hnsw.exploreAdditionalHits":%i}]nearestNeighbor(vector,vector);' % (10,3*512)
Recall
Average recall Vespa = 0.994600
Average recall Opendistro for Elastic = 0.995500
Performance Vespa - actual query rate: 53.32 Q/s at Open distro ES - actual query rate: 13.13 Q/s
The open distro memory usage is very high, 35GB compared to Vespa's total including configuration services, search core process and serving container less than 10GB.
Is there a way to control ef_search in the query with open distro ES @mnagaya ?
Is there a way to control ef_search in the query with open distro ES @mnagaya ?
Probably not.
Great, thanks for confirming @mnagaya, Odd that this parameter cannot be set at run time as it's quite important for exploring to get a decent recall versus performance tradeoff. I'll capture a few more rounds and update the README with the results. Again, thank you for your contribution.
@jobergum couple of suggestions for benchmarking with opendistro Elasticsearch's k-NN
graphs are loaded to memory only when you search and cached later on. So initial query could take a hit. So you might want to run queries for warm up, may be couple of minutes. We have plan to expose warmup api to make it easy.
Force merge to single segment to have one single graph.
POST /
Avoid loading all the stored fields If you are just trying out vector search(all you need is the nearest doc ids for the query vector), you can improve the performance by asking Elasticsearch not to read the stored fields.
Example;-
{
"size": 5,
"stored_fields": "_none_",
"docvalue_fields": ["_id"],
"query": {
"knn": {
"v": {
"vector": [-0.16490704,-0.047262248,-0.078923926],
"k": 50
}
}
}
}
You might find this link useful for indexing/search tuning https://medium.com/@kumon/how-to-realize-similarity-search-with-elasticsearch-3dd5641b9adb
@vamshin,
Yes, if you look at the README the benchmarking parameters is explained. A 20 second warmup is included, I also run multiple times before publishing results in the README. Also welcome discussion with developers, like in https://github.com/jobergum/dense-vector-ranking-performance/issues/2.
On merging of segments - Yes, this seem to be a major factor with both ES variants and should be mentioned. In the README we mentioned this ,but also this highlights the weakness of the ES/Lucene indexing architecture with immutable segments.
I also notice that there are very large off-heap allocations for OES, 1M 960 vectors touches 55G (Heap is 16G).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 elastic+ 20 0 54.8g 34.6g 4.9g S 100.7 13.8 326:12.45 java
00007f93eaabb000 120180 119944 0 r--s- _bn.cfs
00007f924c000000 130272 130272 130272 rw--- [ anon ]
00007f8ecc000000 131072 130696 130696 rw--- [ anon ]
00007f8e5c000000 131072 131072 131072 rw--- [ anon ]
00007f8e64000000 131072 131072 131072 rw--- [ anon ]
00007f8e6c000000 131072 131072 131072 rw--- [ anon ]
00007f8e74000000 131072 131072 131072 rw--- [ anon ]
00007f8e7c000000 131072 131072 131072 rw--- [ anon ]
00007f8e84000000 131072 131072 131072 rw--- [ anon ]
00007f8e8c000000 131072 131072 131072 rw--- [ anon ]
00007f8e94000000 131072 131072 131072 rw--- [ anon ]
00007f8e9c000000 131072 131072 131072 rw--- [ anon ]
00007f8ea4000000 131072 131072 131072 rw--- [ anon ]
00007f8eac000000 131072 131072 131072 rw--- [ anon ]
00007f8eb4000000 131072 131072 131072 rw--- [ anon ]
00007f8ebc000000 131072 131072 131072 rw--- [ anon ]
00007f8ec4000000 131072 131072 131072 rw--- [ anon ]
00007f8ed4000000 131072 131072 131072 rw--- [ anon ]
00007f8edc000000 131072 131072 131072 rw--- [ anon ]
00007f8ee4000000 131072 131072 131072 rw--- [ anon ]
00007f8eec000000 131072 131072 131072 rw--- [ anon ]
00007f8ef4000000 131072 131072 131072 rw--- [ anon ]
00007f8efc000000 131072 131072 131072 rw--- [ anon ]
00007f8f04000000 131072 131072 131072 rw--- [ anon ]
00007f8f0c000000 131072 131072 131072 rw--- [ anon ]
00007f8f14000000 131072 131072 131072 rw--- [ anon ]
00007f8f1c000000 131072 131072 131072 rw--- [ anon ]
00007f8f24000000 131072 131072 131072 rw--- [ anon ]
00007f8f2c000000 131072 131072 131072 rw--- [ anon ]
00007f8f34000000 131072 131072 131072 rw--- [ anon ]
00007f8f3c000000 131072 131072 131072 rw--- [ anon ]
00007f9244000000 131072 131072 131072 rw--- [ anon ]
00007f92e0000000 131072 131072 131072 rw--- [ anon ]
00007f92e8000000 131072 131072 131072 rw--- [ anon ]
00007f92f0000000 131072 131072 131072 rw--- [ anon ]
00007f92f8000000 131072 131072 131072 rw--- [ anon ]
00007f9300000000 131072 131072 131072 rw--- [ anon ]
00007f9308000000 131072 131072 131072 rw--- [ anon ]
00007f9310000000 131072 131072 131072 rw--- [ anon ]
00007f9338000000 131072 131072 131072 rw--- [ anon ]
00007f9394000000 131072 131072 131072 rw--- [ anon ]
00007f93b8000000 131072 131072 131072 rw--- [ anon ]
00007f92bde5e000 142268 141992 0 r--s- _78.cfs
00007f921ec1e000 142596 142596 142596 rw--- [ anon ]
00007f929edb8000 149792 149648 0 r--s- _7i.cfs
00007f922775f000 162000 161688 0 r--s- _a0.cfs
00007f92d14d2000 175288 174932 0 r--s- _ey.cfs
00007f97e805b000 261780 260620 260620 rw--- [ anon ]
00007f9353a4a000 267992 267656 0 r--s- _da.cfs
00007f92732a0000 338440 338080 0 r--s- _7u.cfs
00007f920573d000 369420 368952 0 r--s- _9g.cfs
00007f91902c1000 397932 397448 0 r--s- _b1.cfs
00007f9156a1b000 470948 470500 0 r--s- _dz.cfs
00007f9173604000 471796 471336 0 r--s- _cf.cfs
00007f91a875c000 475012 475012 0 r--s- _6c_Lucene80_0.dvd
00007f91c573d000 1048576 1048576 0 r--s- _6c_Lucene80_0.dvd
00007f908a562000 3347172 3347172 3347172 rw--- [ anon ]
0000000400000000 16777216 16777216 16777216 rw--- [ anon ]
total kB 57421916 36263388 3117405
Thanks @jobergum.
Graphs are loaded outside ES heap, so we could take advantage of bigger instances with more RAM.
For 1M vectors with 960 dimensions k-NN graph memory usage would be in the order of (4d+8M) Bytes/vector
For 1M vectors, graph usage should be (4*960 + 8*16)*1M bytes
= ~4GB.
Total memory usage should be from ES heap + graph memory usage + lucene.
@jobergum
Have you increase value of "-n" parameter in vespa-fbench?
I got the following results in my environment.
Server 2 x Intel(R) Xeon Silver 4114 2.20 GHz (Skylake).
Results Engine | QPS | Average Latency (ms) | 95P Latency (ms) | Recall@10 |
---|---|---|---|---|
OES | 592.09 | 84.28 | 126.84 | 0.989400 |
Vespa | 492.95 | 101.31 | 160.80 | 0.996300 |
Thanks,
Before merge segments (which took more than 2 hours)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 elastic+ 20 0 54.8g 34.6g 4.9g S 100.7 13.8 326:12.45 java
After:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 elastic+ 20 0 53.9g 29.9g 31944 S 7.0 11.9 439:55.03 java
16G is from the heap, remaining is off-heap. After merging segments the recall also changes with the given index time setting for ef_search from 0.995500 with more segments (one hnsw graph per segment?) to 0.979600 with a single segment which also makes the recall more comparable with Vespa with the same ef_search setting (Vespa's query time setting "hnsw.exploreAdditionalHits")
Average recall Vespa = 0.980400
Average recall Opendistro for Elastic = 0.979600
With lower recall and less segments to search with the single threaded query execution over N segments of ES the performance improves significantly
Vespa: actual query rate: 95.26 Q/s
Open ES actual query rate: 84.79 Q/s
All details https://gist.github.com/jobergum/ff46385c44bbb1683be16d98a8eed6ba
@mnagaya here as in several ann benchmarks (e.g http://ann-benchmarks.com/) one reports QPS as a function of the average latency with 1 client. E.g 10 ms latency average gives 100 QPS. This gives a baseline, but I agree that it's also nice to prove that the technology uses scales with increased concurrency as I'm pretty sure many of the libraries benchmarked in ann-benchmarks will have issues with contention at high concurrency.
@jobergum thank you.
Did you run with the following suggestion in the ES query(avoid reading stored fields and just read scores and doc ids). Looks like all we need is the neighbors for recall?
Example as mentioned in above comments
{
"size": 5,
"stored_fields": "_none_",
"docvalue_fields": ["_id"],
"query": {
"knn": {
"v": {
"vector": [-0.16490704,-0.047262248,-0.078923926],
"k": 50
}
}
}
}
This would improve performance from our performance tests. Would be great if you could run with the above suggestion
@vamshin The last quoted run above with 85 QPS was with the suggested stored_fields and docvalue_fields yes. I have not had time to format and re-run the results to update the README yet. When OES has merged the segments into 1 the performance of the two engines is almost equal. Though, the impact the number of segments have on search performance is in OES is mind blowing, also the time it takes to merge segments which will make a big impact in a real time feed scenario.
Though, the impact the number of segments have on search performance is in OES is mind blowing, also the time it takes to merge segments which will make a big impact in a real time feed scenario.
@jobergum We can control the number of segments creation right at indexing by disabling the refresh interval or having a large refresh interval. Once all the indexing is done, we could then call refresh
to create segments. This way we end up with minimal segments.
Would love to see the results in README. thank you.
Is there any plan to compare with OpenDistro Elasticsearch K-NN?
https://opendistro.github.io/for-elasticsearch/features/knn.html