select_zero for SparseVector

Keats commented 2 years ago

Hello there,

First thanks for the very nice library!

I see in the README on Planned Functionality:

select_zero() for SparseVector?

and in the docs:

/// For dense bitvectors or when [SelectZero] is needed, [BitVector] is a better choice.

Our usecase is the following: we have big bitvectors (7bn+ bits) and some of them are very dense (think 99+%). We already use SparseVector when they are sparse and we currently use https://github.com/ajalab/fid when they are not. It works ok but FID is 10-30x slower to build than simple-sds bitvectors. One solution would be to use an inverted sparse vector but we would need to be able to select the real 1s on it. Does it make sense? I could use a normal BitVector but it would end up being much bigger than a SparseVector.

jltsiren commented 2 years ago

Elias–Fano encoding does not store enough information to support indexed access to the missing values efficiently. The SDSL equivalent of select_zero() is basically binary search with select() to find the right run of 0s and the total number of 0s before that run. You can optimize it a bit by precomputing the results for the first few levels of binary search. And once the range is narrow enough, using select_iter() instead of select() and continuing with a sequential scan will be faster.

I'll probably implement this eventually, but determining good parameters for the optimizations will take some time.

Keats commented 2 years ago

Amazing, thanks!

jltsiren / simple-sds

select_zero for SparseVector #10