Locality-sensitive hashing for Apache Spark. Largely a PySpark port of the spark-hash project.
This project follows the main workflow of the spark-hash Scala LSH implementation. Its core lsh.py
module accepts an RDD-backed list of either dense NumPy arrays or PySpark SparseVectors, and generates a model that is simply a wrapper around all the intermediate RDDs generated. More details on each of these steps will follow.
It is important to note that while this pipeline will accept either dense or sparse vectors, the original hash function from spark-hash will almost certainly fail with dense vectors, resulting in all vectors being hashed into all bands. Work is currently underway to implement alternative hash functions that more evenly split dense vectors. For the sparse case, the results duplicate those of spark-hash.
Usage follows that of the spark-hash project. Parameters remain the same.
Command line-parameters:
Other parameters:
minhash
hashing function. It is an integer that is larger than the number of dimensions of the original data, and is used to generate the random numbers that seed the minhash function.As described in the MMDS book, LSH can be tuned by adjusting the number of rows and bands such that:
threshold = (1.0 / bands) ** (1.0 / (rows / bands))
Naturally, the number of rows, bands, and the resulting size of the band (rows/bands) dictates the quality of results yielded by LSH. Higher thresholds produces clusters with higher similarity. Lower thresholds typically produce more clusters but sacrifices similarity.
Regardless of parameters, it may be good to independently verify each cluster. One such verification method is to calculate the jaccard similarity of the cluster (it is a set of sets). SciPy jaccard similarity is used, although future development will allow for user-supplied distance functions.
Work on this project is ongoing and includes:
minhash
.