Open jmr opened 1 year ago
ARM NEON has a byte-wise popcount instruction, which helps to optimize select_bit and PopCount::count. Use it for AArch64 (64-bit ARM).
select_bit
PopCount::count
15% speedup for Rank1, 4% for Select0 and 3% for Select1. (60% for PopCount::count itself.)
Rank1
Select0
Select1
This gives a 9% speedup on select0 and 7% on select1. (Tested on Pixel 3 in armeabi-v7a mode.)
select0
select1
This is likely because the branches of this unrolled linear search are more predictable than the binary search that was used previously.
Instead of computing (counts | MASK_80) - ((i + 1) * MASK_01), we pre-compute a lookup table
(counts | MASK_80) - ((i + 1) * MASK_01)
PREFIX_SUM_OVERFLOW[i] = (0x80 - (i + 1)) * MASK_01 = (0x7F - i) * MASK_01
then use counts + PREFIX_SUM_OVERFLOW[i].
counts + PREFIX_SUM_OVERFLOW[i]
This uses a UInt64[64] or 0.5kiB lookup table. The trick is from: Gog, Simon and Matthias Petri. “Optimized succinct data structures for massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.
UInt64[64]
https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3
This gives a 2-3% speedup for BitVector::select0/select1.
BitVector::select0
@glebm
ARM NEON has a byte-wise popcount instruction, which helps to optimize
select_bit
andPopCount::count
. Use it for AArch64 (64-bit ARM).15% speedup for
Rank1
, 4% forSelect0
and 3% forSelect1
. (60% forPopCount::count
itself.)This gives a 9% speedup on
select0
and 7% onselect1
. (Tested on Pixel 3 in armeabi-v7a mode.)This is likely because the branches of this unrolled linear search are more predictable than the binary search that was used previously.
Instead of computing
(counts | MASK_80) - ((i + 1) * MASK_01)
, we pre-compute a lookup tablethen use
counts + PREFIX_SUM_OVERFLOW[i]
.This uses a
UInt64[64]
or 0.5kiB lookup table. The trick is from: Gog, Simon and Matthias Petri. “Optimized succinct data structures for massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3
This gives a 2-3% speedup for
BitVector::select0
/select1
.