nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.35k stars 644 forks source link

Applicability of HNSW for approximate sorting #232

Open jwimberley opened 4 years ago

jwimberley commented 4 years ago

A general question rather than an issue, really ---

I've been experimenting with the use of HNSW for a (very slightly) different problem than k-nearest neighbors, and wanted to ask whether it's really within HNSW's comfort zone, so to speak. The task is, given a dataset of N points, to be able to traverse the dataset in approximate nearest-order from many arbitrary starting points in the space. Thus, "k" is not really predetermined, and could be anywhere between 1 and N, depending on whether the search finds what it wants before it exhausts the dataset -- which I expect (in my use case) to be quite frequent.

As a first step, I've adapted the Hierarchical::searchKnn method to return an approximately sorted vector of all points in the dataset, in a method approxSortKnn in this forked branch:

https://github.com/jwimberley/hnswlib/tree/approx_graph_sorting

Essentially the brute force modification overloads searchBaseLayerST with a similar method that bounds the size of the priority_queue of top candidates by Ef, and pops out the excess elements (ordered by negative distance) into the quasi-sorted return vector. Based on some tests this appears to work, but certainly might be flawed due to misunderstandings of HNSW on my part. A more complete implementation would be a visitor method that would apply a user supplied functor to each candidate in the search and send a termination signal.

Is this a valid task that HNSW could be used for? Or do you think other approaches would more suited to traversing nearly the full dataset from nearest-to-farthest neighbors?

The forked branch I linked to above includes some trivial tests running and timing this approximate sort. I note in the commit logs that this method is currently slower than fully sorting the dataset using std::sort, but I don't think that's very fair comparison just yet since

searchivarius commented 4 years ago

What's exactly sorting here?

jwimberley commented 4 years ago

Sorry, I was imprecise -- I used "sorting" above to mean "traversing a fixed set of N points in d-dimensional space in approximately increasing order of Euclidean distance from an arbitrary point in d-dimensional space." In my current experimental branch, the "sort" method just returns an std::vector of length N containing the indices of the dataset points, for the client to traverse through, but this is obviously wasteful if the client finds a matching result early in the traversal order.

yurymalkov commented 4 years ago

@jwimberley Well, it seem to make sense to me. Can you give the context? What is the application?

jwimberley commented 4 years ago

@yurymalkov I can give a general context. The problem is to search the dataset for a point x that is both close to a test point T (the closer the better, without having to be exact) and satisfies some arbitrary boolean condition S(x,t). For example, this could be just a generic search routine to find a nearby neighbor satisfying some extra property; or a heuristic for a min-cost assignment problem, where there's a bipartite graph with left nodes T and right nodes X, and an edge from t to x either does not exist or has a cost related to the distance between t and x, with the nodes x perhaps being capacitized.

@searchivarius Thanks for the suggestion! If I'm reading it correctly, this would amount to running HierarchicalNSW::searchKnn with k=N, or are you suggesting something different? The main reason why didn't think this was workable, originally, is that I thought that the final step of inserting all N elements into a priority queue (and thus sorting them completely) would make the method no better than just sorting them from the get-go. (That was the motivation behind my implementation I linked to, where I change the return type from a priority queue to a vector). However, I think now I might have been mistaken about that. My understanding of the implementation is that elements are inserted into top_candidates here

https://github.com/nmslib/hnswlib/blob/3c6a84fb54a2a92dc4b7dc3e13dbae3d99892444/hnswlib/hnswalg.h#L318-L319

in some close-to-sorted order, because the candidates are coming from the underlying graph traversal, and contrary to what I thought originally this is more efficient that inserting elements into std::priority_queue in a random order.

yurymalkov commented 4 years ago

@jwimberley If you need to do two things - do knn search and statisfy a boolean condition, why not to apply the condition to the existing element deletion logic? E.g. you can treat the elements that do not satisfy the condition as deleted. I think this can be checked by overriding the isMarkedDeleted method.

jwimberley commented 4 years ago

The boolean condition varies on each query, so the overhead of deleting and undeleting elements before each query wouldn't be worth it, I think.

Based on some of the dialogue here, I'm going to try two directions:

Thank you both for your suggestions!

yurymalkov commented 4 years ago

@jwimberley Sure. Concerning the deletions - you do not need to "delete" the element, you can just change the method isMarkedDeleted to your boolean method.