opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

How to resolve the problem of segment forcemerge cost too long time? #275

Closed wrkaiser closed 3 years ago

wrkaiser commented 3 years ago

vector_dimension=256 docs_count=1M index.refresh_interval= "5m"//five minute

when i request the "xxx/_forcemerge?max_num_segments=1&flush=true" http interface, segment merge successed cost 1.5h.

jmazanec15 commented 3 years ago

Hi @wrkaiser,

Merge is an operation that we are looking to improve in the future. Currently, how it works is that we take the raw vectors from 2 segments and then build a new graph that contains all of them. This process is expensive and can take a long time.

At the moment, there are a few strategies you can follow to reduce the time:

  1. If you want to just have 1 segment, you can disable the refresh interval completely during indexing: index.refresh_interval = -1 and then renable after indexing finishes. I see that you increased refresh_interval to 5 minutes already, but going further and disabling refresh may help.
  2. Set replicas to 0 until after indexing and merging finishes (if you have any). If you index with replicas, the graph building process will be duplicated for the primary and the replica. If you enable replicas after indexing, the graph will just be copied to the replica
  3. Increase knn.algo_param.index_thread_qty. By default, it is set to 1. Increasing will speed up graph building time
  4. If you are able to, try lowering ef_construction parameter. This will impact recall, but if you are able to lower while still meeting your requirements, this will improve graph building speed.
  5. Lastly, don't merge to 1 segment unless you need to. Because the HNSW algorithm scales with O(log(N)) complexity, searching over 1 graph with 1M documents will be faster than searching sequentially over 10 graphs with 100K documents each (i.e. log(1M) < 10log(100K)). However, if you are able to still meet your latency requirement with 10 segments, this will decrease merge time.
wrkaiser commented 3 years ago

i get~ thank you very much.