mediastandardstrust / superfastmatch

A tool for bulk text comparison and analysis
http://superfastmatch.org
Other
119 stars 10 forks source link

README

This is a new version of Superfastmatch written in C++ to improve matching performance and with an index running totally in memory to improve response times.

The point of the software is to index large amounts of text in memory. Therefore there isn't much reason to run it on a 32-bit OS with a 4GB cap on memory and a 64-bit OS is assumed

The process for installation is as follows:

Dependencies

Superfastmatch depends on these libraries:

Google gflags

Google perftools

Google ctemplate

Google sparsehash

RE2

Kyoto Cabinet

Kyoto Tycoon

You might be able to get away with installing the .deb packages on the listed project pages, but this is untested.

The easier route is to run:

./scripts/bootstrap.sh

and wait for everything to build. The script will ask you for your sudo password, which is required to install the libraries.

On Ubuntu you'll need to do this first:

sudo apt-get install libunwind7-dev mercurial curl build-essential zlib1g-dev

And you might also need a:

sudo ldconfig

after the script has finished.

On Fedora/Amazon AMI this will to allow bootstrap.sh to complete:

sudo yum update
sudo yum install git
sudo yum install svn
sudo yum install gcc
sudo yum install gcc-c++
sudo yum install zlib-devel
sudo yum install mercurial
wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99.tar.gz
tar xzf libunwind-0.99.tar.gz
cd libunwind-0.99
./configure && make && sudo make install

and you might have to add /usr/local/lib to /etc/ld.so.conf

Test

After the libraries are installed, you can run:

make check

to run the unit tests for the code.

Build

After that you can run:

make run

to get a superfastmatch instance running. Nothing is currently configurable from the command line yet. Coming soon...

Visit http://127.0.0.1:8080 to test the interface.

Data

For a quick introduction to what can be found with superfastmatch try this:

If you have a machine with less than 8GB of memory and less than 4 cores run:

./superfastmatch -debug -hash_width 24 -reset -slot_count 2 -thread_count 2 -window_size 30

otherwise this will be much faster:

./superfastmatch -debug -reset -window_size 30

And then finally, in another terminal window, run:

./scripts/gutenberg.sh

to load some example documents and associate them with each other. You can view the results in the browser at:

http://127.0.0.1:8080/document/

Daemonizing

See contrib/init.d for an example init.d script. Makes use of fuser which may require:

sudo apt-get install psmisc

Feedback

All feedback welcome. Either create an issue here or ask a question on the mailing list.

Known Issues

This is still an early release halfway between Alpha and Beta! There are known issues with large documents affecting the document list and detail pages and the full REST specification is not yet implemented. Lots of fixes, new features and performance improvements are currently in development so keep checking the commit log!

Acknowledgements

Thanks to Martin Moore and Ben Campbell at Media Standards Trust for ongoing support for the project and to Tom Lee, Drew Vogel, Kaitlin Lee and James Turk at Sunlight Labs for being willing testers, early adopters and proponents of open source!

Thanks also to Mikio Hirabayashi for assistance and the excellent open source Kyoto Cabinet and Kyoto Tycoon, to Craig Silverstein for accepting and improving this patch, to Neil Fraser for useful hints and inspiration from Diff-Match-Patch and to Austin Appleby for hashing advice.