jltsiren / simple-sds

Simple succinct data structures (in Rust)
MIT License
45 stars 7 forks source link

Run-length encoded bitvector #16

Closed jltsiren closed 9 months ago

jltsiren commented 9 months ago

This PR adds a run-length encoded bitvector similar to the one used in RLCSA. The vector is encoded as a sequence of maximal runs of unset and set bits.

If there are n0 unset bits followed by n1 set bits, the run is encoded as a pair of integers (n0, n1 - 1). The integers are encoded using 4-bit code units with a 3+1-bit encoding. The encoded runs are concatenated to form 32-byte blocks, which are padded with 0s when necessary. For each block, we store a sample (rank(i, 1), i), where i is the number of bits encoded up to that block. There are also second-level indexes for narrowing down the range of samples for each query type.