Rcpp bindings for hnswlib.
February 4 2024 RcppHNSW 0.6.0 is released to CRAN, supporting hnswlib version 0.8.0.
September 19 2023 RcppHNSW 0.5.0 is released to CRAN, supporting
hnswlib version 0.7.0,
a getItems
method for returning the items used to build the index and some
performance improvements if your data is already column-stored. Also, a small
roxygen problem with the package documentation was fixed.
July 18 2022 RcppHNSW 0.4.1 is released. Unfortunately, there are valgrind problems with the version of hnswlib used in RcppHNSW 0.4.0, so that has been rolled back.
July 16 2022 RcppHNSW 0.4.0 is released. This release matches hnswlib version 0.6.2, but otherwise adds no new features. Some minor CRAN check NOTEs are fixed and there is also a minor license change: previously the license was GPLv3. From this version, it now supports GPLv3 or later.
September 6 2020 RcppHNSW 0.3.0 is now available on CRAN, with multi-threading support.
August 30 2020. Although not yet on CRAN, support for building and searching
an index in parallel (via the n_threads
function argument and setNumThreads
object method) has been added to the current development version (available via
devtools::install_github
). Thanks to
Dmitriy Selivanov for a lot of the work on
this.
September 20 2019. RcppHNSW 0.2.0 is now available on CRAN, up to date with
hnswlib at https://github.com/nmslib/hnswlib/commit/c5c38f0, with new methods:
size
, resizeIndex
and markDeleted
. Also, a bug that prevented searching
with datasets smaller than k
has been fixed. Thanks to
Yuxing Liao for spotting that.
January 21 2019. RcppHNSW is now available on CRAN.
October 20 2018. By inserting some preprocessor symbols into hnswlib, these
bindings no longer require a non-portable compiler flag and hence will pass R CMD CHECK
without any warnings: previously you would be warned about
-march=native
. The price paid is not using specialized functions for the
distance calculations that are architecture-specific. I have not checked how bad
the performance hit is. The old settings remain in src/Makevars
and
src/Makevars.win
(commented out), if you want to build the project from
source directly. Otherwise, Release
0.0.0.9000 is
the last version with the old behavior, which can be installed with something
like:
devtools::install_github("jlmelville/rcpphnsw@v0.0.0.9000")
hnswlib is a header-only C++ library for finding approximate nearest neighbors (ANN) via Hierarchical Navigable Small Worlds (Yashunin and Malkov, 2016). It is part of the nmslib project.
An R package that interfaces with hnswlib, taking enormous amounts of inspiration from Dirk Eddelbuettel's RcppAnnoy package which did the same for the Annoy ANN C++ library.
One difference is that I use
roxygen2 to generate the man
pages. The NAMESPACE
is still built manually, however (I don't believe you can
export
the classes currently).
From CRAN:
install.packages("RcppHNSW")
Development versions from github:
devtools::install_github("jlmelville/RcppHNSW")
irism <- as.matrix(iris[, -5])
# function interface returns results for all rows in nr x k matrices
all_knn <- RcppHNSW::hnsw_knn(irism, k = 4, distance = "l2")
# other distance options: "euclidean", "cosine" and "ip" (inner product distance)
# for high-dimensional data you may see a speed-up if you store the data
# where each *column* is an item to be indexed and searched. Set byrow = TRUE
# for this.
# Admittedly, the iris dataset is *not* high-dimensional
iris_by_col <- t(irism)
all_knn <- RcppHNSW::hnsw_knn(iris_by_col, k = 4, distance = "l2", byrow = FALSE)
# process can be split into two steps, so you can build with one set of data
# and search with another
ann <- hnsw_build(irism[1:100, ])
iris_nn <- hnsw_search(irism[101:150, ], ann, k = 5)
As noted in the "Do not use named parameters" section below, you should avoid using named parameters when using class methods. But I do use them in a few places below to document the name of the parameters the positional arguments refer to.
library(RcppHNSW)
data <- as.matrix(iris[, -5])
# Create a new index using the L2 (squared Euclidean) distance
# nr and nc are the number of rows and columns of the data to be added, respectively
# ef and M determines speed vs accuracy trade off
# You must specify the maximum number of items to add to the index when it
# is created. But you can increase this number: see the next example
M <- 16
ef <- 200
dim <- ncol(data)
nitems <- nrow(data)
ann <- new(HnswL2, dim, nitems, M, ef)
# Add items to index
for (i in 1:nitems) {
ann$addItem(data[i, ])
}
# Find 4 nearest neighbors of row 1
# indexes are in res$item, distances in res$distance
# set include_distances = TRUE to get distances as well as index
res <- ann$getNNsList(data[1, ], k = 4, include_distances = TRUE)
# It's more efficient to use the batch methods if you have all the data you
# need at once
ann2 <- new(HnswL2, dim, nitems, M, ef)
ann2$addItems(data)
# Retrieve the 4 nearest neighbors for every item in data
res2 <- ann2$getAllNNsList(data, 4, TRUE)
# labels of the data are in res$item, distances in res$distance
# If you are able to store your data column-wise, then the overhead of copying
# the data into a form usable by hnsw can be noticeably reduced
data_by_col <- t(data)
ann3 <- new(HnswL2, dim, nitems, M, ef)
ann3$addItemsCol(data_by_col)
# Retrieve the 4 nearest neighbors for every item in data_by_col
res3 <- ann3$getAllNNsListCol(data_by_col, 4, TRUE)
# The returned neared neighbor data matrices are also returned column-wise
all(res2$item == t(res3$item) & res2$distance == t(res3$distance))
# Save the index
ann$save("iris.hnsw")
# load it back in: you do need to know the dimension of the original data
ann4 <- new(HnswL2, dim, "iris.hnsw")
# new index should behave like the original
all(ann$getNNs(data[1, ], 4) == ann4$getNNs(data[1, ], 4))
# other distance classes:
# Cosine: HnswCosine
# Inner Product: HnswIP
# Euclidean: HnswEuclidean
Here's a rough equivalent of the serialization/deserialization example from
the
hnswlib README,
but using the recently-added resizeIndex
method to increase the size of the
index after its initial specification, avoiding having to read from or write
to disk:
library("RcppHNSW")
set.seed(12345)
dim <- 16
num_elements <- 100000
# Generate sample data
data <- matrix(stats::runif(num_elements * dim), nrow = num_elements)
# Split data into two batches
data1 <- data[1:(num_elements / 2), ]
data2 <- data[(num_elements / 2 + 1):num_elements, ]
# Create index
M <- 16
ef <- 10
# Set the initial index size to the size of the first batch
p <- new(HnswL2, dim, num_elements / 2, M, ef)
message("Adding first batch of ", nrow(data1), " elements")
p$addItems(data1)
# Query the elements for themselves and measure recall:
idx <- p$getAllNNs(data1, k = 1)
message("Recall for the first batch: ", formatC(mean(idx == 1:nrow(data1))))
# Increase the total capacity, so that it will handle the new data
p$resizeIndex(num_elements)
message("Adding the second batch of ", nrow(data2), " elements")
p$addItems(data2)
# Query the elements for themselves and measure recall:
idx <- p$getAllNNs(data, k = 1)
# You can get distances with:
# res <- p$getAllNNsList(data, k = 1, include_distances = TRUE)
# res$dist contains the distance matrix, res$item stores the indexes
message("Recall for two batches: ", formatC(mean(idx == 1:num_elements)))
Although there's no longer any need for this, for completeness, here's how you
would use save
and new
to achieve the same effect without resizeIndex
:
filename <- "first_half.bin"
# Serialize index
p$save(filename)
# Reinitialize and load the index
rm(p)
message("Loading index from ", filename)
# Increase the total capacity, so that it will handle the new data
p <- new(HnswL2, dim, filename, num_elements)
unlink(filename)
Because these are wrappers around C++ code, you cannot use named parameters in the calling R code. Arguments are parsed by position. This is most annoying in constructors, which take multiple integer arguments, e.g.
### DO THIS ###
dim <- 10
num_elements <- 100
M <- 200
ef_construction <- 16
index <- new(HnswL2, dim, num_elements, M, ef_construction)
### DON'T DO THIS ###
index <- new(HnswL2, dim, ef_construction = 16, M = 200, num_elements = 100)
# treated as if you wrote:
index <- new(HnswL2, dim, 16, 200, 100)
new(HnswL2, dim, max_elements, M = 16, ef_contruction = 200)
creates a new
index using the squared L2 distance (i.e. square of the Euclidean distance),
with dim
dimensions and a maximum size of max_elements
items. ef
and M
determine the speed vs accuracy trade off. Other classes for different distances
are: HnswCosine
for the cosine distance and HnswIp
for the "Inner Product"
distance (like the cosine distance without normalizing).new(HnswL2, dim, max_elements, M, ef_contruction, random_seed)
same as the
previous constructor, but with a specified random seed.new(HnswL2, dim, filename)
load a previously saved index (see save
below)
with dim
dimensions from the specified filename
.new(HnswL2, dim, filename, max_elements)
load a previously saved index (see
save
below) with dim
dimensions from the specified filename
, and a new
maximum capacity of max_elements
. This is a way to increase the capacity of
the index without a complete rebuild.setEf(ef)
set search parameter ef
.setNumThreads(num_threads)
Use (at most) this number of threads when adding
items (via addItems
) and searching the index (via getAllNNs
and
getAllNNsList
). See also the setGrainSize
parameter.setGrainSize(grain_size)
The minimum amount of work to do (adding or
searching items) per thread. If you don't have enough work for all the threads
specified by setNumThreads
to process grain_size
items per thread, then
fewer threads will be used. This is useful for cases where the cost of context
switching between larger number of threads would outweigh the performance gain
from parallelism. For example, if you have 100 items to process and asked for
four threads, then 25 items will be processed per thread. However, setting the
grain_size
to 50 will result in 50 items being processed per thread, and
therefore only two threads being used.addItem(v)
add vector v
to the index. Internally, each vector gets an
increasing integer label, with the first vector added getting the label 1
, the
second 2
and so on. These labels are returned in getNNs
and related methods
to identify which vector in the index are neighbors.addItems(m)
add the row vectors of the matrix m
to the index. Internally,
each row vector gets an increasing integer label, with the first row added
getting the label 1
, the second 2
and so on. These labels are returned in
getNNs
and related methods to identify which vector in the index are
neighbors. The number of threads specified by setNumThreads
is used for
building the index and may be non-deterministic.addItemsCol(m)
Like addItems
but adds the column vectors of m
to the
index. Storing data column-wise makes copying the data for use by hnsw
more
efficient.save(filename)
saves an index to the specified filename
. To load an index,
use the new(HnswL2, dim, filename)
constructor (see above).getItems(ids)
returns a matrix where each row is the data vector from the
index associated with integer indices in the vector of ids
. For cosine
similarity, the l2 row-normalized vectors are returned. ids
are one-indexed,
i.e. to get the first and tenth vectors that were added to the index, use
getItems(c(1, 10))
, not getItems(c(0, 9))
.getNNs(v, k)
return a vector of the labels of the k
-nearest neighbors of
the vector v
. Labels are integers numbered from one, representing the
insertion order into the index, e.g. the label 1
represents the first item
added to the index. If k
neighbors can't be found, an error will be thrown.
This normally means that ef
or M
have been set too small, but also bear in
mind that you can't return more items than were put into the index.getNNsList(v, k, include_distances = FALSE)
return a list containing a
vector named item
with the labels of the k
-nearest neighbors of the vector
v
. Labels are integers numbered from one, representing the insertion order
into the index, e.g. the label 1
represents the first item added to the index.
If include_distances = TRUE
then also return a vector distance
containing
the distances. If k
neighbors can't be found, an error is thrown.getAllNNs(m, k)
return a matrix of the labels of the k
-nearest neighbors
of each row vector in m
. Labels are integers numbered from one, representing
the insertion order into the index, e.g. the label 1
represents the first item
added to the index. If k
neighbors can't be found, an error is thrown. The
number of threads specified by setNumThreads
is used for searching.getAllNNsList(m, k, include_distances = FALSE)
return a list containing a
matrix named item
with the labels of the k
-nearest neighbors of each row
vector in m
. Labels are integers numbered from one, representing the insertion
order into the index, e.g. the label 1
represents the first item added to the
index. If include_distances = TRUE
then also return a matrix distance
containing the distances. If k
neighbors can't be found, an error is thrown.
The number of threads specified by setNumThreads
is used for searching.getAllNNsCol(m, k)
like getAllNNs
but each item to be searched in m
is
stored by column, not row. In addition the returned matrix of k
-nearest
neighbors is also stored column-wise: i.e. the dimension of the return value
matrix is k x n
where n
is the number of items (columns) in m
. By passing
the data column-wise, some overhead associated with copying data to and from
hnsw
can be reduced.getAllNNsListCol(m, k)
like getAllNNsList
but each item to be searched in
m
is stored by column, not row. In addition, the matrices in the returned
list are also stored column-wise: i.e. the dimension of the return value matrix
is k x n
where n
is the number of items (columns) in m
. By passing the
data column-wise, some overhead associated with copying data to and from hnsw
can be reduced.size()
returns the number of items in the index. This is an upper limit on
the number of neighbors you can expect to return from getNNs
and the other
search methods.markDeleted(i)
marks the item with label i
(the i
th item added to the
index) as deleted. This means that the item will not be returned in any further
searches of the index. It does not reduce the memory used by the index. Calls to
size()
do not reflect the number of marked deleted items.resize(max_elements)
changes the maximum capacity of the index to
max_elements
.getNNsList
or as input in markDeleted
or getItems
, the
labels represent the order in which the items were added to the index, using
1-indexing to be consistent with R. So in the Python bindings, the first item in
the index has a default of label 0
, but here it will have label 1
.