nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.27k stars 629 forks source link

The HNSW graph search returns less than the world size (so recall can never be 1.0) ? #352

Open sameraamar opened 2 years ago

sameraamar commented 2 years ago

Hi

Thanks for the great job and implementation.

Let's assume my world has N vectors. I have noticed that:

  1. if I run search where k=N (get me all vectors), I get only part of it.
  2. More than that, there are vectors that are never returned by hnsw::search() method

Note: I am doing #1 as a sanity test but the more interesting is #2.

Let's take this scenario as an example: M=32 efConstruction=800 My world has 101 vectors

When running hnsw indexer, I get a graph which has entry point vector #25 layer 1 with 4 vectors layer 0 with 101 nodes

The issue is that the vectors from #1 to #100 , each one has exactly 64 neighbors in layer 0, but non of them is vector #101 !

Item #101 is too far from everybody else so in fact it was isolated. The search() will never return vector #101 because we run using BFS , and our starting point is always the entry point (which is vector #25)

In fact, even if I run search(q) when my query is vector #101 itself, I get 100 vectors and non of them is #101 (as stated above, this is because we do BFS when our entry point #25)

How HNSW should deal with such scenario ? What am I doing wrong here?

Thanks

yurymalkov commented 2 years ago

Hi @sameraamar,

Yeah, that is known problem, though it rarely manifests in practice.

In theory a relative neighborhood graph which HNSW tries to approximate has only a single connected component, but sometime the candidates that are supplied to the RNG heuristic lack the connections needed to to make a link between two clusters.

That can be solved by extending the neighborhood of the candidates and that helps a lot in case you have such isolated points or clusters (that has been discussed in the original HNSW paper), but such policy makes index construction much more expensive. It also never did not help on any real datasets that I've tested, so it is turned off in nmslib by default via compile flag and not even implemented in hnswlib. I guess the ideal solution would be marking some connections as essential (like the ones that are selected by RNG despite being very far from the base element) and sharing only those.

There also several other solutions which I personally like less: