A Julia package for locality-sensitive hashing to accelerate similarity search.
Traditionally, if you have a data point x
, and want to find the most similar
point(s) to x
in your database, you would compute the similarity between x
and all of the points in your database, and keep whichever points were the most
similar. For instance, this type of approach is used by the classic k-nearest
neighbors algorithm.
However, it has two major problems:
x
is linear in the number of
points in your database. This can make similarity search prohibitively
expensive for even moderately large datasets.Locality-sensitive hashing (LSH) is a technique for accelerating these kinds
of similarity searches. Instead of measuring how similar your query point is to
every point in your database, you calculate a few hashes of the query point and
only compare it against those points with which it experiences a hash collision.
Locality-sensitive hash functions are randomly generated, with the fundamental
property that as the similarity between x
and y
increases, the probability
of a hash collision between x
and y
also increases.
You can install LSHFunctions.jl from the Julia REPL with
pkg> add LSHFunctions
So far, there are hash functions for the similarity functions:
SimHash
)MinHash
)L1Hash
L2Hash
SignALSH
(recommended)MIPSHash
MonteCarloHash
ChebHash
This package still needs a lot of work, including improvement to the documentation and API.
The easiest way to start constructing new hash functions is by calling
LSHFunction
with the following syntax:
hashfn = LSHFunction(similarity function,
number of hash functions to generate;
[LSH family-specific keyword arguments])
For example, the following snippet generates 10 locality-sensitive hash
functions (bundled together into a single SimHash
) for cosine similarity:
julia> using LSHFunctions;
julia> hashfn = LSHFunction(cossim, 10);
julia> n_hashes(hashfn)
10
julia> similarity(hashfn)
cossim
You can hash inputs by calling hashfn()
:
julia> x = randn(128);
julia> x_hashes = hashfn(x);
For more details, check out the LSHFunctions.jl documentation.