magsol / pyspark-lsh

Locality-sensitive hashing in PySpark.
Other
27 stars 12 forks source link

Spark-LSH

Locality-sensitive hashing for Apache Spark. Largely a PySpark port of the spark-hash project.

Prerequisites

Implementation Details

This project follows the main workflow of the spark-hash Scala LSH implementation. Its core lsh.py module accepts an RDD-backed list of either dense NumPy arrays or PySpark SparseVectors, and generates a model that is simply a wrapper around all the intermediate RDDs generated. More details on each of these steps will follow.

It is important to note that while this pipeline will accept either dense or sparse vectors, the original hash function from spark-hash will almost certainly fail with dense vectors, resulting in all vectors being hashed into all bands. Work is currently underway to implement alternative hash functions that more evenly split dense vectors. For the sparse case, the results duplicate those of spark-hash.

Usage

Usage follows that of the spark-hash project. Parameters remain the same.

Parameters

Command line-parameters:

Other parameters:

Tuning

As described in the MMDS book, LSH can be tuned by adjusting the number of rows and bands such that:

threshold = (1.0 / bands) ** (1.0 / (rows / bands))

Naturally, the number of rows, bands, and the resulting size of the band (rows/bands) dictates the quality of results yielded by LSH. Higher thresholds produces clusters with higher similarity. Lower thresholds typically produce more clusters but sacrifices similarity.

Regardless of parameters, it may be good to independently verify each cluster. One such verification method is to calculate the jaccard similarity of the cluster (it is a set of sets). SciPy jaccard similarity is used, although future development will allow for user-supplied distance functions.

Future

Work on this project is ongoing and includes: