s-yata / marisa-trie

MARISA: Matching Algorithm with Recursively Implemented StorAge
Other
515 stars 89 forks source link

Optimize select_bit #52

Open jmr opened 1 year ago

jmr commented 1 year ago
  1. AArch64 byte-wise popcount optimization

ARM NEON has a byte-wise popcount instruction, which helps to optimize select_bit and PopCount::count. Use it for AArch64 (64-bit ARM).

15% speedup for Rank1, 4% for Select0 and 3% for Select1. (60% for PopCount::count itself.)

  1. byte-serial 32-bit version

This gives a 9% speedup on select0 and 7% on select1. (Tested on Pixel 3 in armeabi-v7a mode.)

This is likely because the branches of this unrolled linear search are more predictable than the binary search that was used previously.

  1. Use lookup table

Instead of computing (counts | MASK_80) - ((i + 1) * MASK_01), we pre-compute a lookup table

PREFIX_SUM_OVERFLOW[i] = (0x80 - (i + 1)) * MASK_01 = (0x7F - i) * MASK_01

then use counts + PREFIX_SUM_OVERFLOW[i].

This uses a UInt64[64] or 0.5kiB lookup table. The trick is from: Gog, Simon and Matthias Petri. “Optimized succinct data structures for massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.

https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3

This gives a 2-3% speedup for BitVector::select0/select1.

jmr commented 1 year ago

@glebm