Open jwimberley opened 4 years ago
What's exactly sorting here?
Sorry, I was imprecise -- I used "sorting" above to mean "traversing a fixed set of N points in d-dimensional space in approximately increasing order of Euclidean distance from an arbitrary point in d-dimensional space." In my current experimental branch, the "sort" method just returns an std::vector of length N containing the indices of the dataset points, for the client to traverse through, but this is obviously wasteful if the client finds a matching result early in the traversal order.
@jwimberley Well, it seem to make sense to me. Can you give the context? What is the application?
@yurymalkov I can give a general context. The problem is to search the dataset for a point x that is both close to a test point T (the closer the better, without having to be exact) and satisfies some arbitrary boolean condition S(x,t). For example, this could be just a generic search routine to find a nearby neighbor satisfying some extra property; or a heuristic for a min-cost assignment problem, where there's a bipartite graph with left nodes T and right nodes X, and an edge from t to x either does not exist or has a cost related to the distance between t and x, with the nodes x perhaps being capacitized.
@searchivarius Thanks for the suggestion! If I'm reading it correctly, this would amount to running HierarchicalNSW::searchKnn with k=N, or are you suggesting something different? The main reason why didn't think this was workable, originally, is that I thought that the final step of inserting all N elements into a priority queue (and thus sorting them completely) would make the method no better than just sorting them from the get-go. (That was the motivation behind my implementation I linked to, where I change the return type from a priority queue to a vector). However, I think now I might have been mistaken about that. My understanding of the implementation is that elements are inserted into top_candidates here
in some close-to-sorted order, because the candidates are coming from the underlying graph traversal, and contrary to what I thought originally this is more efficient that inserting elements into std::priority_queue in a random order.
@jwimberley If you need to do two things - do knn search and statisfy a boolean condition, why not to apply the condition to the existing element deletion logic? E.g. you can treat the elements that do not satisfy the condition as deleted. I think this can be checked by overriding the isMarkedDeleted method.
The boolean condition varies on each query, so the overhead of deleting and undeleting elements before each query wouldn't be worth it, I think.
Based on some of the dialogue here, I'm going to try two directions:
Thank you both for your suggestions!
@jwimberley Sure.
Concerning the deletions - you do not need to "delete" the element, you can just change the method isMarkedDeleted
to your boolean method.
A general question rather than an issue, really ---
I've been experimenting with the use of HNSW for a (very slightly) different problem than k-nearest neighbors, and wanted to ask whether it's really within HNSW's comfort zone, so to speak. The task is, given a dataset of N points, to be able to traverse the dataset in approximate nearest-order from many arbitrary starting points in the space. Thus, "k" is not really predetermined, and could be anywhere between 1 and N, depending on whether the search finds what it wants before it exhausts the dataset -- which I expect (in my use case) to be quite frequent.
As a first step, I've adapted the Hierarchical::searchKnn method to return an approximately sorted vector of all points in the dataset, in a method approxSortKnn in this forked branch:
https://github.com/jwimberley/hnswlib/tree/approx_graph_sorting
Essentially the brute force modification overloads searchBaseLayerST with a similar method that bounds the size of the priority_queue of top candidates by Ef, and pops out the excess elements (ordered by negative distance) into the quasi-sorted return vector. Based on some tests this appears to work, but certainly might be flawed due to misunderstandings of HNSW on my part. A more complete implementation would be a visitor method that would apply a user supplied functor to each candidate in the search and send a termination signal.
Is this a valid task that HNSW could be used for? Or do you think other approaches would more suited to traversing nearly the full dataset from nearest-to-farthest neighbors?
The forked branch I linked to above includes some trivial tests running and timing this approximate sort. I note in the commit logs that this method is currently slower than fully sorting the dataset using std::sort, but I don't think that's very fair comparison just yet since