An R package for finding approximate nearest neighbors, translated from the Python package PyNNDescent written by the great Leland McInnes. As the name suggests, it uses the Nearest Neighbor Descent method (Dong et al., 2011), but also makes use of Random Partition Trees (Dasgupta and Freund, 2008) as well as ideas from FANNG and NGT.
You can use rnndescent for:
See the Get Started article for the basics. The other vignettes go into more detail.
14 May 2024: Version 0.1.6 has been released to CRAN. The previous release
didn't quite get compatibility with dqrng
right so here is another attempt.
Also a couple of other bug fixes have been included.
install.packages("rnndescent")
# install.packages("pak")
pak::pkg_install("jlmelville/rnndescent")
This packages makes use of C++ code which must be compiled. You may have to carry out a few extra steps before being able to build:
Windows: install
Rtools and ensure
C:\Rtools\bin
is on your path.
Mac OS X: using a custom ~/.R/Makevars
may cause linking errors.
This sort of thing is a potential problem on all platforms but seems to bite
Mac owners more.
The R for Mac OS X FAQ
may be helpful here to work out what you can get away with. To be on the safe
side, I would advise building without a custom Makevars
.
rnndescent
uses C++17. This shouldn't be too big a problem but not all R
platforms support it (sorry if this affects you).
library(rnndescent)
# Find 15-knn
iris_knn <- rnnd_knn(iris, k = 15)
# Build an index
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
# Specify the number of neighbors you are likely to want to query for
iris_index <- rnnd_build(iris_index, k = 15)
# Query then index
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
iris_query_nn <- rnnd_query(index = iris_index, query = iris_odd, k = 15)
For more details, please see the documentation.
Many. See the metrics article for a list.
Compared to PyNNDescent, rnndescent
is currently lacking, in decreasing order
of likelihood of implementation:
saveRDS
like any R data for example), but it's not very
efficient. Keeping the index as an R-wrapped C++ class has its own downsides but
would fix that.Missing Metrics
below for those that are currently not available.The issues around index serialization and parallel behavior make rnndescent
currently unsuitable for streaming applications where you are querying one item
at a time. If you are doing batch queries, where you are querying multiple items
at once, then rnndescent
should be fine: for example, generating nearest
neighbors for UMAP (maybe for use with
uwot). Dimensionality reduction is my
personal use case for nearest neighbors calculation and I would like to get
rnndescent
onto CRAN in a useful-for-something state. As a result I am not
targeting an initial release to support the streaming case. I would like to fix
this for a subsequent release.
Also there is no specialized distance code to take advantage of AVX etc., so
rnndescent
is going to run slower than other packages. This wouldn't be
allowed on CRAN anyway, but might be a nice-to-have for installing from github.
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (pp. 577-586). ACM. doi.org/10.1145/1963405.1963487.
The R package as a whole is licensed under GPLv3 or later. The following files are licensed differently:
inst/include/dqsample.h
is a modification of some sampling code
from dqrng and is
AGPLv3 or later.inst/include/RcppPerpendicular.h
is a modification of some code from
from RcppParallel and is
GPLv2 or laterinst/include/tdoann
, is licensed under the
BSD 2-Clause.As far as I know, these licenses are all compatible with re-licensing under GPLv3 or later.
The following metrics are in PyNNDescent but are not supported in rnndescent:
These require passing extra information as part of the metric definition, which is not currently supported.