url_compress - a static PPM-based URL compressor / decompressor
(c) 2008-2010 Cybozu, Labs. Inc.
INTRODUCTION
The library is a compressor / decompressor specifically designed to compress URLs. Expected compression ratio is from 25% to 40% of the original size. The compressed form does not preserve the original order however it is possible to perform a prefix search without decompressing the URLs.
TO BUILD AND TEST
Generate the compression tables. The following example uses "corpus/travail5" as prior knowledge and generates a 4095-entry compression table set (which will be around 2MB in size).
% scripts/build_chreorder.pl < corpus/travail5 > src/chreorder.c % scripts/build_rctable.pl corpus/travail5 4095 1 1 > src/rctable.c
Compile the test binary.
% g++ -DRANGE_CODER_USE_SSE -g -Wall -O2 -Irangecoder src/url_coder.cpp src/test.c -o /tmp/url_compress_test
Test compression / decompression (as well as the compression ratio) using the corpus.
% /tmp/url_compress_test corpus/travail
ADDING YOUR OWN CORPUS
The compressor uses a static (predefined) compression table. So it is better to use your own list of URLs as the prior knowledge. Use the following command add your list of URLs to the corpus.
% scripts/normalize_corpus.pl < your_own_url_list > corpus/your_urls
Once added, follow the above section: TO BUILD AND TEST.
USING THE LIBRARY FROM MYSQL
Compile the UDF and copy it into lib/mysql/plugin (or lib/plugin) directory of your MySQL installation. (NOTE: on OSX use -bundle instead of -shared -fPIC)
% g++ -DRANGE_CODER_USE_SSE -g -Wall -O2 -Irangecoder -shared -fPIC src/url_coder.cpp src/url_compress_udf.c -o url_compress_udf.so
Then activate the UDFs using the "CREATE FUNCTION" command.
mysql> CREATE FUNCTION my_url_compress RETURNS STRING SONAME 'url_compress_udf.so'; mysql> CREATE FUNCTION my_url_decompress RETURNS STRING SONAME 'url_compress_udf.so';
Check that the UDFs actually work.
mysql> SELECT LENGTH(my_url_compress('http://www.google.com/')); +---------------------------------------------------+ | length(my_url_compress('http://www.google.com/')) | +---------------------------------------------------+ | 3 | +---------------------------------------------------+ 1 row in set (0.00 sec)
mysql> select my_url_decompress(my_url_compress('http://www.google.com/')); +--------------------------------------------------------------+ | my_url_decompress(my_url_compress('http://www.google.com/')) | +--------------------------------------------------------------+ | http://www.google.com/ | +--------------------------------------------------------------+ 1 row in set (0.00 sec)
THANKS TO
@travail and @fujiwara for providing the corpus.