This is a C++ template library for succinct data structures called sdsl.
Succinct data structures are fascinating: They represent an object (like a bitvector, a tree, suffix array,...) in space close the information-theoretic lower bound of the object but the defined operations can still be performed efficiently. Hmmm, at least in theory ;) Actually there is still a big gap between theory and practice. Why? The time complexity of an operations performed on the classical fat data structure and the slim succinct data structure are the same most time in theory. However, in practice succinct structures are slow since the operations require often memory accesses with bad locality of references. Moreover, often the in theory small sub-linear space data structures account for a large amount of memory, since they are only asymptotic sub-linear and the input size for which they are negligible in practice is galactic.
The aim of the library is to provide basic and complex succinct data structure which are
A lot of engineering tricks had to be applied to reach the performance goal, for instance the use a semi-external algorithm, bit-parallelism on 64-bit words, and cache-friendly algorithms.
bit_vector
)bit_vector_interleaved
)rrr_vector<>
)sd_vector<>
)rank_support_v
,rank_support_v5
,select_support_mcl
,...)rrr_rank_support<>
, sd_rank_support<>
,...) coder::elias_delta
)coder::fibonacci
)w
-bit integers (int_vector<w>
)w
-bit integers (int_vector<0>
, w
passed to the constructor)coder
(enc_vector<coder>
)wt
)wt_int
)wt_huff
) wt_rlmn
, wt_rlg
, and wt_rlg8
)csa_wt
)csa_sada
bp_support_sada
) to find_open
, find_close
,
enclose
, double_enclose
,...bp_support_g
, bp_support_gg
)rmq_succinct_sct
, rmq_succinct_sada
)rmq_support_sparse_table
)lcp_dac
)lcp_kurtz
)lcp_wt
)lcp_support_sada
)lcp_support_tree
)lcp_support_tree2
)cst_sada
)cst_sct3
)Let us now show how you can assemble even a very
complex data structure very easily. Lets begin with
the most complex one, a CST!
It basically consists of a CSA, an compressed LCP-array,
and a succinct representation of the tree topology;
each part can be specified by a template parameter.
Say, we want fast navigation operations, so we take
the class cst_sada<cst_type, lcp_type, bp_support_type>
for our CST. Now we can specify the type of CSA.
Lets take a CSA based on wavelet tree:
csa_wt<wt_type, SA_sample_dens, inv_SA_sample_dens>
.
We can recursively specify the used types. So
now we can specify the used wavelet tree, say
a run-length compressed wavelet tree
(wt_rlmn<>
). We could recurse again and specify, each detail
of the wavelet tree (e.g. which rank support structure
should be used) but we stick now with the default
configuration which uses an sd_vector
for the
marking of the heads of the runs in the wavelet tree.
Lets choose at last a LCP array which uses
the topology of the CST and the CSA to
compress the LCP values (lcp_support_tree2
) and
stick with default template parameters for all
types. So the final type looks like this:
cst_sada<cst_wt<wt_rlmn<> >, lcp_support_tree2<> >
.
Now, lets explore the data structure a little bit. We take the english.100MB input from the Pizza&Chili-corpus, construct the CST-object, output its structure, and visualise it using the d3js-library. Have fun with the result.
The data structures in the library can be divided into several classes:
int_vector
)csa_wt
)rank_support_v
addes constant time rank(i)
-functionality to an object of type
bit_vector
, or an object of of type bp_support_sada
adds
find_open(i)
-functionality to a bit_vector
object, which
represents a balanced parentheses sequence.Each sdsl-class X
has to implement the following methods:
X()
X(const &X)
swap(const &X)
serialize(std::ostream &out, structure_tree_node* v, std::string name)
load(std::istream &in)
We provide many handy methods for sdsl objects in the util
namespace:
util::store_to_file(const X &x, const char* file_name)
stores the object x
to the fileutil::clear(X &x)
deletes the object and frees the space util::load_from_file(X &x, const char* file_name)
loads the object x
from the fileutil::assign(X &x, Y &y)
if the type of X
equals Y
, then x
and y
are swapped,
otherwise y
is assigned to x
by x = T(y)
util::get_size_in_bytes(const X &x)
returns the number of bytes needed to represent
object x
in memory.util::write_structure<FORMAT>(const X &x, std::ostream &out)
writes the structure
of the data structure in JSON or R format (FORMAT
=JSON_FORMAT
or R_FORMAT
)The library was successfully tested on the following configurations
We plan to support Windows in the near future.
The installation requires that the cmake tool
and a C++ compiler (e.g. from the GNU Compiler Collection)
is installed.
You can than install the library into a directory SDSL_INSTALL_DIR
by
calling
./install SDSL_INSTALL_DIR
If SDSL_INSTALL_DIR
is not specified your home directory is used.
Please use an absolute path name for SDSL_INSTALL_DIR
.
The library header files will be located in the directory
SDSL_INSTALL_DIR/include
and the library in the directory
SDSL_INSTALL_DIR/lib
. After the installation you can
execute the tests in the test
directory or start
with some code examples in the examples
folder.
We have used the gtest framework for the tests.
Compile with make
and run tests with make test
. We have another
target vtest
which runs the test with the valgrind tool.
make test
will try to download some texts from a
gutenberg.org mirror. See the README file in the directory for details.
Compile the examples with make
and experience
how esay it is to use succinct data structures.
The current version includes Yuta Mori's incredible fast suffix array construction library libdivsufsort version 2.0.1.
Here is a list of contributes:
Code:
Bug reports:
New contributors are welcome any time!
Have fun with the library!