The HNSW graph search returns less than the world size (so recall can never be 1.0) ?

Thanks for the great job and implementation.

Let's assume my world has N vectors. I have noticed that:

if I run search where k=N (get me all vectors), I get only part of it.
More than that, there are vectors that are never returned by hnsw::search() method

Note: I am doing #1 as a sanity test but the more interesting is #2.

Let's take this scenario as an example: M=32 efConstruction=800 My world has 101 vectors

When running hnsw indexer, I get a graph which has entry point vector #25 layer 1 with 4 vectors layer 0 with 101 nodes

The issue is that the vectors from #1 to #100 , each one has exactly 64 neighbors in layer 0, but non of them is vector #101 !

Item #101 is too far from everybody else so in fact it was isolated. The search() will never return vector #101 because we run using BFS , and our starting point is always the entry point (which is vector #25)

In fact, even if I run search(q) when my query is vector #101 itself, I get 100 vectors and non of them is #101 (as stated above, this is because we do BFS when our entry point #25)

How HNSW should deal with such scenario ? What am I doing wrong here?

Thanks

Hi @sameraamar,

Yeah, that is known problem, though it rarely manifests in practice.

In theory a relative neighborhood graph which HNSW tries to approximate has only a single connected component, but sometime the candidates that are supplied to the RNG heuristic lack the connections needed to to make a link between two clusters.

That can be solved by extending the neighborhood of the candidates and that helps a lot in case you have such isolated points or clusters (that has been discussed in the original HNSW paper), but such policy makes index construction much more expensive. It also never did not help on any real datasets that I've tested, so it is turned off in nmslib by default via compile flag and not even implemented in hnswlib. I guess the ideal solution would be marking some connections as essential (like the ones that are selected by RNG despite being very far from the base element) and sharing only those.

There also several other solutions which I personally like less:

never prune. This would help, there will be some connections as we don't delete them, but even those that will be there would be hard to discover, so it does not really solve the problem
one can assign the high level elements by doing a hierarchical clustering, that would mean there are links to the isolated elements from the higher levels. But the clustering also can miss the element and you cannot get k more than the number of elements in the cluster and it does not play well with updates.
One can make sure that every element is discoverable explicitly. That has been done in NSG, but that does not play well with insertions and updates as they screw this up.
One can look for the elements with small in-degree and find good candidates for those elements which happen to have room. This indeed increases a recall-speed a bit, but does not solve the problem with clustered data, as it can happen even if all elements have high in-degrees but fall into several separated groups.

nmslib / hnswlib

The HNSW graph search returns less than the world size (so recall can never be 1.0) ? #352