Open daniel-j-h opened 3 years ago
I have added the building blocks for succinct rank/select data structures
uses the BMI2 instruction set on modern processors for uint64 rank/select.
From here we can implement e.g. the poppy paper above, or a different approach.
Alex Bowe has some amazing material from this domain; we should check out his writing
https://raw.githubusercontent.com/alexbowe/wavelet-paper/thesis/thesis.pdf
See also this paper describing https://github.com/ot/succinct
Also this thesis going with it.
This paper https://arxiv.org/abs/1706.00990 describes the machine word select we are using for succinct data structures
The paper http://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf - termed "poppy" - from
"Space-efficient, high-performance rank and select structures on uncompressed bit sequences" by Zhou, Andersen, and Kaminsky
sounds very practical keeping in mind the realities of hardware architectures and cache hierarchies; they
and more importantly they also provide a simple basic rank sketch which is simple to implement, has 12.5% space overhead, and allows us then to iterate easily in the future. The first iteration builds up a single index; the second iteration improves on this to get down to 3.2% space overhead for rank.
Their main insights (for rank, since that's what I have read so far) are as follows
This means we create an index and store one 64bit uint for every 64 bytes (= 512 bits) in the original bit vector.
During rank query time we use the pre-computed index, plus the remaining bits in the last basic block.
Adding here
We already provide the machine word popcount functionality in the bits module
and memory alignment to a 64 byte cache line we can simply do with posix_memalign(&p, 64, size)
.
Since opening this issue there has been research to improve on the cs-poppy solution we wanted to start with here.
An improvement on poppy, called "pasta" (see "pasta-flat" in the paper):
Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors
In practice, the smallest (uncompressed) rank and select data structure cs-poppy has a space overhead of ≈ 3.51 % [Zhou et al., SEA ‘13]. Using the same overhead, we present a data structure that can answer queries up to 8 % (rank) and 16.5 % (select) faster compared with cs-poppy.
An improvement on poppy and pasta altogether:
SPIDER: Improved Succinct Rank and Select Performance
However, rank and select query performance still incurs a tradeoff between query time and space. For example, Vigna [ 27 ] gives a data structure for rank queries using 25% space that is roughly 19% faster than pasta-flat, and a data structure for select queries using 12.2% space, which is roughly 65% faster than pasta-flat.
and a few tricks
I have a first version ready in https://github.com/tinygraph/tinygraph/pull/41
We should look into implementing rank/select structures for bitset / elias-fano https://github.com/tinygraph/tinygraph/issues/13
For example have a look at
which seems to be simple and practical.
cc @ucyo