OpenLSH is an open source platform that implements Locality Sensitive Hashing.
To familiarize yourself with LSH, please visit the demo site.
To try it, you will need a Gmail-based account and a Twitter account.
That is the main idea of LSH: the incoming "Documents" — tweets in this case — are assigned to "buckets". This can be done as documents come in from the source, as implemented in the demo. The resulting buckets can be examined and identical or similar documents just fall into the same buckets. Each document will fall into multiple buckets. We examine buckets that have multiple documents in them and report on the ones that have a high degree of Jaccard similarity of their 4-gram character shingles.
Central to OpenLSH is a sparse matrix data structure Matrix
.
It has as many rows as there are documents and as many columns as there are buckets (up to 232 buckets in this implementation).
A related data structure MatrixRow
represents a document.
A line format class specifies how to parse incoming documents.
The line format class should specify a static method parse
which returns a document ID and the text to be used for classification.
See classes TweetLine
and PeerbeltLine
in the code base as examples.
You may follow the implementation of PeerbeltLine as a template for parsing documents in your implementation —
the original text is HTML and we strip out content inside <script>
and <style>
tags and also remove other HTML tags before assigning documents to buckets.
In this case, documents that have the same text but differ only in their markup are considered to be identical.
A database specifies how to interact with the database. settings.py
specifies which database is being used.
Implementations are available for Google App Engine Datastore and for Cassandra.
An in-memory database based on Python data structures is also available and it is handy for testing.
To start, replicate the code in your own App Engine instance.
git clone https://github.com/singhj/locality-sensitive-hashing.git
).twitter_settings.py
as appropriate.
Take care to specify a Callback URL correctly and
Ours is http://open-lsh.datathinks.org/twitter_callback
, yours should be customized to your setup.app.yaml
.Example Peerbelt data is provided in the example_data
directory. To try OpenLSH against this file,
settings.py
to use the in-memory database.serial.py
specifying the provided data file as input.First, write a line format class for parsing your input data.
Second, if you are working with a different database than Cassandra or Datastore, you will need to write a database driver; we can help with this if required.
A command-line implementation is available in serial.py
, which may be adapted if you will be running from the command line.
A web-based implementation is available in read_tweepy.py
.
A map-reduce based implementation may be available soon, contact us for details.
The implementation is intended to be compact in its use of storage; we don't store intermediate results (i.e., minhash values) in the database.
To compute the similarity of two documents, we need to programmatically compare them. And we do.
You may wish to make a different trade-off, and uncomment the minhash lines in the implementation of MatrixRow
.
Some of the LSH parameters are embedded in the Matrix
constructor.
If you are implementing another database driver, or other changes that would benefit others, please send a pull request.