nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.3k stars 633 forks source link

Correlation of `M`, `k` and `number of items` in the index #563

Open alexzaf7 opened 4 months ago

alexzaf7 commented 4 months ago

Hello,

I am creating various indexes with the following options:

The number of items in each index is the only thing that can vary, mostly 100 < max_elements < 20000. Notice that the items are all added at once, without performing additions or resizing.

The issue I face is that when I perform a search on an index with a smaller size (~1000), I get back the "Cannot return the results in a contigious 2D array. Probably ef or M is too small" error. I set k to min(max_elements, 2000) so I never request more items than the index size).

The same knn query on an index with more items (~5000) works fine.

After some experimentation, setting M=50 seems to work and regardless of the index size this error is not occuring anymore.

Why the above fixed the issue is still unclear to me and it brings me back to my initial question, if there is any correlation between M, k and the number of items, or any correlation between these parameters I should be aware of.

Thanks

yurymalkov commented 4 months ago

Hi @alexzaf7,

For indices that small and k that big you are probably better of with a brute-force approach. The problem is that the graph is sometimes loses some of the elements, leading to some elements being not reachable.

If the intention to get all items, you'd be better off with a brute-force index, especially if the sizes are small. HNSW's benefit is due to sparsity (k<<N)