SimilaritySearch.jl is a library for nearest neighbor search. In particular, it contains the implementation for SearchGraph,
a fast and flexible search index using any metric function. It is designed to support multithreading in most of its functions and structures.
The package provides the following indexes:
ParallelExhaustiveSearch
: A brute force search index where each query is solved using all available threads.ExhaustiveSearch
: A brute force search index, each query is solved using a single thread.SearchGraph
: An approximate search index with parallel construction.The main set of functions are:
search
: Solves a single query.searchbatch
: Solves a set of queries.allknn
: Computes the $k$ nearest neighbors for all elements in an index.neardup
: Removes near-duplicates from a metric dataset.closestpair
: Computes the closest pair in a metric dataset.The precise definitions of these functions and the complete set of functions and structures can be found in the documentation.
Currently, there exists several packages dedicated to nearest neighbor search, for instance we have NearestNeighbors.jl
, RegionTrees.jl
, and JuliaNeighbors
implement search structures like kd-trees, ball trees, quadtrees, octrees, bk-trees, vp-tree and other multidimensional and metric structures. These structures work quite well for low dimensional data since they are designed to solve exact similarity queries.
There exist several packages performing approximate similarity search, like Rayuela.jl
using product quantization schemes, the wrapper for the FAISS
library Faiss.jl
. The FAISS library provides high-performance implementations of product quantization schemes and locality-sensitive hashing schemes, along with an industrial-strength implementation of the HNSW
index. The NearestNeighborDescent.jl
implements the search algorithm behind pynndescent
.
The SimilaritySearch.jl
package tries to enrich the ecosystem with search structures and algorithms designed to take advantage of multithreading systems and a unique autotuning feature that simplifies its usage for practitioners. These features are succinctly and efficiently implemented due to the Julia programming language dynamism and performance.
Regarding performance characteristics, the construction times are vastly reduced compared to similar approaches without reducing search performance or result quality.
You may install the package as follows
] add SimilaritySearch.jl
also, you can run the set of tests as follows
] test SimilaritySearch
Please see examples. You will find a list of Jupyter and Pluto notebooks, and some scripts that exemplifies its usage.
Contributions are welcome. Please fill a pull request for documentating and implementation contributions. For issues, please fill an issue with the necessary information (see below.) If you already have a solution please also provide a pull request.
Report issues in the package providing a minimal reproducible example. If the issue is data dependant, please don't forget to provide the necessary data to reproduce it.
SearchGraph
The main search structure, the SearchGraph,
is a graph with several characteristics, many of them induced by the dataset being indexed. Some of its known limitations are related to these characteristics. For instance:
The following manuscript describes and benchmarks the SearchGraph
index (package version 0.6
):
@article{tellezscalable,
title={A scalable solution to the nearest neighbor search problem through local-search methods on neighbor graphs},
author={Tellez, Eric S and Ruiz, Guillermo and Chavez, Edgar and Graff, Mario},
journal={Pattern Analysis and Applications},
pages={1--15},
publisher={Springer}
}
The current algorithm (version 0.8
and 0.9
) is described and benchmarked in the following manuscript:
@misc{tellez2022similarity,
title={Similarity search on neighbor's graphs with automatic Pareto optimal performance and minimum expected quality setups based on hyperparameter optimization},
author={Eric S. Tellez and Guillermo Ruiz},
year={2022},
eprint={2201.07917},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
This package is also described in the JOSS paper:
Eric S. Tellez and Guillermo Ruiz.
SimilaritySearch.jl
: Autotuned nearest neighbor indexes for Julia. Journal of Open Source Software https://doi.org/10.21105/joss.04442.
The algorithms of this version are the same as v0.8 but break API compatibility:
Polyester
package to handle multithreading instead of Threads.@threadsallknn
now preserves self-references to simplify algorithms and improve efficiency (allknn
in v0.8 removes self-references automatically)Others:
SearchGraph
graph pruning methodstimedsearchbatch
functionIt makes easy to adjust the SearchGraph
structure to different workloads and applications. For instance,
Please refer to https://github.com/sadit/SimilaritySearchDemos and https://github.com/sadit/SimilaritySearch.jl/blob/main/test/testsearchgraph.jl for working examples.
It introduces a major refactoring. In particular, it makes explicit use of context objects for most functions. It also introduces simple logging procedures. However, we preserve compatibility in many public functions using implicit use of default context objects.