nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.36k stars 646 forks source link

multi-node parallel hnsw #377

Closed jianshu93 closed 9 months ago

jianshu93 commented 2 years ago

Dear team,

This is a question on whether we can parallel hnsw on different computing node (so distributed memory) but not just on one node. If not, I split my database into several pieces and build graph for each piece and then query to each sub-graph to return best results. In the end, collecting results for all sub-result and sorted based on distance to have an overall result for the entire database. Does this sounds problematic to you in theory? Or any recommendations for multi-node solutions?

Thanks,

Jianshu

yurymalkov commented 2 years ago

Hi @jianshu93,

I think that is reasonable and that is how it generally works in nearest neighbor search services. IMO, the main limitation of the approach is that in such organization, without additional replication the throughput with the number of compute nodes (=compute budget) would scale sub-linearly (as there is less freedom for divide and conqueror approaches).

jianshu93 commented 2 years ago

Hello @yurymalkov,

Thanks for the quick response, I did some testing and it actually works. For example, top 100 best hits are the same after collecting results from 10 pieces of database of the original big database and sorted by distance. Is there any evident that I can cite that in NNS sevices, this does work or any other theoretical/experimental analysis?

Thanks,

Jianshu

searchivarius commented 2 years ago

Hi @jianshu93 it's not clear what you mean by the distributed computation. In the most common scenario, which is called sharding, the database is split into K chunks that queried in || (as an options aggregation can be organized hierarchically if the number of chunks is large). Modulo some network delays this certainly works (though challenging for really large # of chunks, as packets will get lost or delays will be too large).

If you split the neighborhood graph itself, it's unlikely to work efficiently unless you have an extremely fast network and all chunks are sitting next to each other. However, this is still likely to be slower compared to regular sharding. The reason is that sending packages from one machine and back is going to be quite costly (and you may need to do it multiple times during a single search). So even if you can make throughput to be good, the latency is going to suffer a lot.

jianshu93 commented 2 years ago

@searchivarius Hello, Hello yes I was talking about sharding (split database before building graph), but NOT split graph. Empirically, after pool results from 10 pieces of database and sort by distance, I alway got the same top 10 compare to not sharding. But I cannot find a prove shows that this works in theory.

Thanks,

Jianshu

searchivarius commented 2 years ago

@jianshu93 if the search is perfect than getting top-k results from each of the K-shards provably retrieves top-k of the complete collection. The reason is simple: imagine some number k1 <= k of the global top-k are in the i-th shard. Because they are best globally they are also best in this shard (among top-k retrieved).

Thus, by retrieving at least k record from each shard you won't lose anything. However, if you retrieve fewer than k, you may lose some. Furthermore, you can lose some results because of the approximate nature of the search. Assuming that a probability of losing one neighbor in the shard is not less than a global losing probability, you can likewise argue that your "sharded" recall is going to be the same.

Thus, with the approximate retrieval you won't get the same results, but they should be about the same and on average, the recall should be the same. It can actually be a bit higher, because it's easier to get a high recall with a small collection.

jianshu93 commented 2 years ago

thanks!this is super helpful!

Jianshu

sandszhouSZ commented 2 years ago

thanks, @searchivarius

zzzmm1 commented 1 year ago

Hello @searchivarius If we use data slices to search a query in parallel, the distance required to search the graph constructed by each data slice is roughly the same as the distance required to search the whole graph directly with the same ef settings. Then the search time required for each data slice is about the same as the original one, so is it not valid to use data slices to search in parallel to improve the search speed? If I have something wrong, please point it out.