migumar2 / libCSD

C++ Library implementing Compressed String Dictionaries
GNU Lesser General Public License v2.1
45 stars 9 forks source link

libCSD is a C++ library providing some different techniques for managing string dictionaries in compressed space. These approaches are inspired on the paper:

========================================================================== "Compressed String Dictionaries" Nieves R. Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro 10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011.

The current release is a beta version providing full functionality for string location and ID extraction. Besides, some techniques support queries by prefix and substrings. The set of techniques is described as follows:

The techniques support different parameterizations to achieve varied and competitive space/time tradeoffs.

Compiling the library

1) For building the libcds library, type "make" within the libcds folder. 2) Once built libcds, type "make" within the root folder and libCSD is ready.

Building a dictionary

The library provides a simple command-line script for building dictionaries:

./Build

It is worth noting that "in" dictionary file must be lexicographically sorted and strings must be ended with the '\0' ASCII char.

Examples:

./Build 1 h 10 geonames dicts/geo.10

Builds a HASH dictionary for "geonames" and stores it as "dicts/geo.10". The dictionary uses a hash table which hash 10% more cells than strings in "geonames" and compresses them using Huffman (h).

./Build 4 r 16 geonames dicts/geo.16

Builds a HTFC (Hu-Tucker Front-Coding) dictionary for "geonames" and stores it as "dicts/geo.16". The dictionary uses buckets of 16 strings and compresses them using Re-Pair.

Testing a dictionary

The library also provides a command-line script for testing purposes:

./Test

Examples:

./Test r l dicts/geo.10 tests/geo.strings

Uses the testbed "tests/geo.strings" for analyzing how the dictionary stored at "dicts/geo.10" behaves for 'locate'.

./Test g 100000 dicts/geo.10 tests/geo

Uses the dictionary stored at "dicts/geo.10" for generating a basic testbed of 100,000 strings and IDs which are stored as "tests/geo".

./Test p 16 dicts/geo.10 tests/geo

Uses the dictionary stored at "dicts/geo.10" for generating a prefix testbed comprising patterns of lengths longer and shorter than 16 chars. The resulting pattern set are stored at "tests/geo".

If you find bugs or have any issue with library, please ask us. Enjoy the library and if you find it useful for your research, please cite our paper:

@article{MPBCCN16, author = "M.A. Mart{\'{\i}}nez-Prieto and N. Brisaboa and R. C{\'a}novas
and F. Claude and G. Navarro", title = "Practical Compressed String Dictionaries", booktitle = "Information Systems", year = "2016", volume = {56}, pages = "73--108" }