use Rcpphnsw instead of Rcppannoy to accelerate nearest neighbor search

jianshu93 commented 2 years ago

Dear uwot developer,

I think hnsw is about 10-100 time faster than annoy to find nearest neighbors especially for large scale dataset. The Rcpphnsw is a very good copy of original hnsw without SIMD acceleration. 3 distances are supported.

Thanks,

Jianshu

jlmelville commented 2 years ago

Hi @jianshu93,

HNSW is faster than Annoy for queries, but the benchmarks don't usually include the cost of building the index, which is usually slower than for Annoy. This makes sense for a lot of use cases where the index is built once and then queried many times, but for UMAP, there is one build and one query.

The last time I checked, the overall cost of building an HNSW index + querying was slower than for Annoy (for a given accuracy).

You may also be interested in the discussion at: https://github.com/lmcinnes/umap/issues/213

jianshu93 commented 2 years ago

Hi Thanks for the response,

I now understand it very well. I notice another hnsw implementation called granny (https://github.com/granne/granne), the index and query step are nearly done simultaneously and it is very fast (run the example to have a try), word indexing and embedding takes seconds (400,000 database vector, 300 dimension). Would that be relevant and worth some efforts to make a R wrapper? It was built using Rust, which is a memory efficient and high performance language. It was designed for a search engine.

Thanks,

Jianshu

jlmelville commented 2 years ago

Thank you for the pointer to granne, which I hadn't seen. Unfortunately, I don't program in Rust currently, and don't know what the state of interfacing such code with R is, but I will keep it in mind.

jianshu93 commented 2 years ago

Hello uwot developer,

There is one more discussion I want to start: for annoy, random projection forest was constructed first but you still need to run the query step for nearest neighbors for each element in the database right. For hnsw, as long as the small world graph is constructed, nearest neighbors for each datapoint in database is already there and just extract them (e.g., a small world graph to kgraph). So there is not need to query for hnsw. The time complexity for database building for hnsw is Nlog(N) while search is log(N), where N is the number of elements in database. For annoy, I think time complexity to build is Log(N) right (binary split of database elements) while you still need to search to find nearest neighbors for each database element. I am wondering whether Nlog(N) is better or bad than log(N) + time complexity of query for annoy, which is at least an order of magnitude slower than that of log(N) for hnsw.

Thanks,

Jianshu

jlmelville commented 2 years ago

For hnsw, as long as the small world graph is constructed, nearest neighbors for each datapoint in database is already there and just extract them (e.g., a small world graph to kgraph). So there is not need to query for hnsw.

Does hnswlib provide an interface to convert the build small world network to a knn graph?

jianshu93 commented 2 years ago

As far as I know，No. We may need to explore a little bit or implement ourselves. I feel like it is at least as fast as annoy because searching nearest neighbors for all elements is expensive.

I think it is easy because we only need the bottom layer and extract top k best neighbors for each element.

Thanks Jianshu

jianshu93 commented 2 years ago

graph structure is essentially a list recording links to other elements with distance.

jlmelville commented 2 years ago

Whoever attempts this (and it's not likely to be me), this seems like something that should be pursued at the C++ level in hnswlib itself. Probably not something I would want to maintain in RcppHNSW and would not be appropriate for inclusion in uwot directly. But good luck if you intend to pursue this as I would certainly be happy to make use of it.

jianshu93 commented 2 years ago

Hello James,

I noticed that for the search step, annoy always uses 4 threads though I assigned 24 to it on a server, there are no acceleration compare to running it on a laptop. Is the search well parallelized?

Thanks,

Jianshu

jlmelville commented 2 years ago

@jianshu93 can you open a new issue for this?

In that issue, please provide both the command you ran and the output (by adding verbose = TRUE) if necessary. In addition can you say how you know only 4 of the 24 threads were assigned (e.g. uwot output or from looking at top or something else?)

jianshu93 commented 2 years ago

Hello James,

I will start a new issue.

Another question to you: is annoy robust to high dimension dataset (or more precisely, high local intrinsic dimension dataset with LID>20, see the NN-descent paper, Dong et.al., 2010). From my perspective, random projection will also suffer from curse of dimensionality, especially for L2 distance. I am wondering the benefits of using annoy instead of N-descent (O(N^1.14), which is quite fast, even faster than hnsw, which is O(Nlog(N)), note that only larger than O (N^(1+1/e)) is slower than O(Nlog(N))). What is the time complexity for annoy both build+search, is there a precise analysis about the complexity?

A very interesting paper: The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Check figure 2 in the paper, as LID increases (left to right), annoy performance decrease much faster than hnsw, this is very interesting. For glove (angular distance), the most right one, both annoy and hnsw have bad recall.

Thanks,

Jianshu

jlmelville commented 2 years ago

In my experience Annoy does not perform well with high dimensional datasets: even with a large value of n_trees and search_k it can run for a long time and still give poor accuracies. This remains a major outstanding problem for uwot from my perspective. However, Annoy is used in many, many other methods (in R or Python) so it's not a problem that uwot alone faces.
I am not a fan of the typical alternative strategy of carrying out a truncated SVD to ~100 dimensions and calculating Annoy on that (although it is implemented in uwot), as it may cause substantial distortion of the underlying data.
The Python UMAP package does not suffer as badly from the problem because it uses pynndescent which uses Nearest Neighbor Descent (initialized with a random projection tree). pynndescent seems to give better results than Annoy for high dimensional data.
I don't know of any complexity analysis for Annoy.

Nearest Neighbor Descent seems like a good option in principle. Note that you can use Annoy and nearest neighbor descent, by using NND to "optimize" the result returned by Annoy. This is what largevis does.

The problem is that the last time I looked, there is no package on CRAN for NND. If there was, to be a practical addition to uwot, it would be need to be sufficiently flexible to handle the metrics used by uwot and to deal with transforming new data. For the latter, you would need to either: store the original training data as part of the uwot "model" (which would obviously substantially increase the model size) or require the user to pass it separately in uwot_transform, which is a poor user experience.

jianshu93 commented 2 years ago

Hello James,

I want to revisit this topic. As I said before, after building the hnsw, k best neighbors can be extracted from hnsw graph bottom layer (layer 0) in consistent time. We have implemented it recently in Rust. I analyze the NN descent time complexity N^1.14, which will be much slower than hnsw, which N(log(N)) for N >> 10^5. For annoy, build the database is log(N), however to improve accuracy, we need to build the tree multiple times, like 50 trees. We then need to search each element in the database for those 50 trees, search a binary tree is Log(N), for N element will be N(log(N)) and then also multiple by 50, so it will be 50Nlog(N), which is 50 times slower than HNSW. NN descent is very bad and slow for hubness dataset, which has been shown in a number of papers. I do not understand why in your benchmark hnsw is much slower than annoy, because build is Nlog(N), search is log(N) N elements, it is only 2Nlog(N), even if you search each element after building, which is not necessary at all. Note that hnswlib header files is not parallelized, but the python binding is paralleled, do you compare single thread hnsw with multiple thread annoy?

Thanks,

Jianshu

jlmelville commented 2 years ago

Hi Jianshu,

I want to revisit this topic. As I said before, after building the hnsw, k best neighbors can be extracted from hnsw graph bottom layer (layer 0) in consistent time. We have implemented it recently in Rust.

Nice, I remember you mentioning this as an idea. I am glad you got it to work. I would be happy to make use of it in RcppHNSW if the original C++ library ever gets this functionality.

We then need to search each element in the database for those 50 trees, search a binary tree is Log(N), for N element will be N*(log(N)) and then also multiple by 50, so it will be 50_N_log(N), which is 50 times slower than HNSW.

I am no computer scientist, but I don't think big-O analysis works like that.

NN descent is very bad and slow for hubness dataset, which has been shown in a number of papers

My experience with pynndescent is that it is competitive with other methods even for high-dimensional datasets. The difference may be that NN Descent traditionally used a random initialization, and pynndescent uses a tree-based initialization. Have you compared pynndescent with HNSW?

Note that hnswlib header files is not parallelized, but the python binding is paralleled, do you compare single thread hnsw with multiple thread annoy?

I don't think so. But maybe I misremember.

If you run your own benchmarks for using multi-threaded Annoy and HSNW (through R) across a sufficient number of datasets, I would be very happy to see them. I could certainly be persuaded that HNSW is superior for nearest neighbor search and concede this point.

But there would still be the effort to integrate it into uwot and then maintain compatibility with both Annoy and HNSW which is a lot of work. I have slogged away on an nearest neighbor descent library to give better nearest neighbor performance for many years without being satisfied with it, and that's without the prospect of safely integrating it into uwot, which is not a task I particularly look forward to, so that's really the limiting factor to adding more nearest neighbor functionality.

If the Annoy-based nearest neighbor search is holding you back, you can generate the nearest neighbors in HNSW and (once converted to a suitable form) pass that to the nn_method argument in the umap function:

library(uwot)
library(RcppHNSW)

iris_nbrs <- hnsw_knn(as.matrix(iris[, -5]), k = 15)
iris_umap_hnsw <- umap(iris, nn_method = iris_nbrs)

Absolutely no Annoy involved in that example.

jianshu93 commented 2 years ago

Hello James,

I am not sure about the annoy big O. They refuse to provide one. I just analyze based on my understandings of binary search tree, which is very similar to the space partitioning in annoy.

Parallelism was only supported after Aug 2020, I saw that you committed to the Rcpp: parallel_for() in Rcpphnsw page. I think your benchmark is based on the paralleled version right?

Thanks for the R guidance, I will definitely test it using Mnist-fashion dataset.

Jianshu

jlmelville commented 2 years ago

I am not sure about the annoy time complexity because they refuse to provide one but for hnsw I am pretty sure because the > original paper provides details.

Right but to clarify, I don't think this is a fruitful line of inquiry. You can't use big O notation to say: two algorithms have NlogN scaling, but you need to run algorithm 1 50 times so it's 50 times slower than algorithm 2. That's not how big O analysis works. It's saying what the asymptotic behavior of the algorithm is. If something is O(NlogN) that doesn't mean it scales exactly with NlogN. It means it scales by c * NlogN where c is some constant (well, the actual definition is a bit different but that's close enough). The point is that when N becomes very large, then c becomes unimportant. But both Annoy and HNSW have different values for c. If HNSW's c is more than 50 times bigger than the one Annoy has then it doesn't matter if you have to run Annoy 50 times, Annoy comes out ahead. The exact value of c is going to be different depending on any specific value of N and the hardware etc. That's why you do the big-O analysis, to avoid all of this and only talk about what happens as N tends to infinity.

You would be better off benchmarking Annoy and HSNW on a large number of datasets with different values of k and parameters and finding the runtime that it takes to reach equivalent accuracies. Not an easy task.

Also, because compiler settings are restricted with R packages, you also have to take that into account for the purposes of uwot: I can't use architecture-specific optimizations, which could make a big difference to any Annoy vs HNSW comparison. But I understand that isn't necessarily of value to anyone who doesn't want to use an R package.

It was only supported after 2020, I get It from the Rcpphnsw page:

August 30 2020. Although not yet on CRAN, support for building and searching an index in parallel (via the n_threads function argument and setNumThreads object method) has been added to the current development version (available via devtools::install_github). Thanks to Dmitriy Selivanov for a lot of the work on this.

Sure, but bear in mind that I am the maintainer of that package so I wrote that! I am very aware of when that package added multi-threading. It would have been a single-threaded comparison that I looked at.

It is an interesting question about how well adding threads works with either method but not one I have looked at.

jianshu93 commented 2 years ago

A quick answer, with k=50, hnsw is about 2 times faster than annoy for now, for the fashion dataset, embedding results are nearly the same (small differences). I will attached the figure soon.

Jianshu

jianshu93 commented 2 years ago

I am trying using the largest dataset I have even seen, the Higgs boson dataset generated by large hadron collider: https://archive.ics.uci.edu/ml/datasets/HIGGS

You can download the data folder and the HIGGS.csv in it is the 11 million datapoint with 28 features (1st column is label). If my big O analysis is somewhat close the the truth, you will see huge difference in terms of the NN step between annoy and hnsw. My testing using a 128 thread cluster showed that it takes only 10 min for the NN step using hnsw. I believe NN descent cannot finish within several hours while annoy could be also very slow because the larger the number of database elements (time complexity).

Thanks,

Jianshu

jlmelville commented 8 months ago

As usual, it's taken years, but RcppHNSW is now supported as an optional dependency (i.e. you must install it yourself). For now this functionality lives on the hnsw branch, but is likely to be merged soon. I really don't want to submit this to CRAN without some substantial testing, and the usual CRAN prep takes weeks so this isn't coming super soon.

But if anyone is feeling brave, I would love to get as many issues ironed out before submission.

install.packages("RcppHNSW")
# I admit I haven't tested this bit
devtools::install_github("jlmelville/uwot", ref = "hnsw")

You can then use HNSW for nearest neighbor search and it can be controlled via the nn_args list using the M, ef_construction and ef parameters e.g.

iris_umap_hnsw <- umap(iris, nn_method = "hnsw", nn_args = list(M = 4, ef = 128, ef_construction = 20))

The meaning of these parameters are given in the hnswlib documentation.

jlmelville commented 8 months ago

This has been merged and tested so should appear in the next CRAN release.

jlmelville / uwot

use Rcpphnsw instead of Rcppannoy to accelerate nearest neighbor search #87