kazuho / url_compress

a static PPM-based URL compressor / decompressor
http://developer.cybozu.co.jp/archives/kazuho/2010/10/compressing-url.html
34 stars 1 forks source link

url_compress - a static PPM-based URL compressor / decompressor

(c) 2008-2010 Cybozu, Labs. Inc.

INTRODUCTION

The library is a compressor / decompressor specifically designed to compress URLs. Expected compression ratio is from 25% to 40% of the original size. The compressed form does not preserve the original order however it is possible to perform a prefix search without decompressing the URLs.

TO BUILD AND TEST

Generate the compression tables. The following example uses "corpus/travail5" as prior knowledge and generates a 4095-entry compression table set (which will be around 2MB in size).

% scripts/build_chreorder.pl < corpus/travail5 > src/chreorder.c % scripts/build_rctable.pl corpus/travail5 4095 1 1 > src/rctable.c

Compile the test binary.

% g++ -DRANGE_CODER_USE_SSE -g -Wall -O2 -Irangecoder src/url_coder.cpp src/test.c -o /tmp/url_compress_test

Test compression / decompression (as well as the compression ratio) using the corpus.

% /tmp/url_compress_test corpus/travail

ADDING YOUR OWN CORPUS

The compressor uses a static (predefined) compression table. So it is better to use your own list of URLs as the prior knowledge. Use the following command add your list of URLs to the corpus.

% scripts/normalize_corpus.pl < your_own_url_list > corpus/your_urls

Once added, follow the above section: TO BUILD AND TEST.

USING THE LIBRARY FROM MYSQL

Compile the UDF and copy it into lib/mysql/plugin (or lib/plugin) directory of your MySQL installation. (NOTE: on OSX use -bundle instead of -shared -fPIC)

% g++ -DRANGE_CODER_USE_SSE -g -Wall -O2 -Irangecoder -shared -fPIC src/url_coder.cpp src/url_compress_udf.c -o url_compress_udf.so

Then activate the UDFs using the "CREATE FUNCTION" command.

mysql> CREATE FUNCTION my_url_compress RETURNS STRING SONAME 'url_compress_udf.so'; mysql> CREATE FUNCTION my_url_decompress RETURNS STRING SONAME 'url_compress_udf.so';

Check that the UDFs actually work.

mysql> SELECT LENGTH(my_url_compress('http://www.google.com/')); +---------------------------------------------------+ | length(my_url_compress('http://www.google.com/')) | +---------------------------------------------------+ | 3 | +---------------------------------------------------+ 1 row in set (0.00 sec)

mysql> select my_url_decompress(my_url_compress('http://www.google.com/')); +--------------------------------------------------------------+ | my_url_decompress(my_url_compress('http://www.google.com/')) | +--------------------------------------------------------------+ | http://www.google.com/ | +--------------------------------------------------------------+ 1 row in set (0.00 sec)

THANKS TO

@travail and @fujiwara for providing the corpus.