Practically efficient methods for performing bit-reversed
permutation in C++11 on the x86-64 architecture
Knauth, Adas, Whitfield, Wang, Ickler, Conrad, Serang, 2017
https://arxiv.org/pdf/1708.01873.pdf
The performance improvement has been independently confirmed in Gnark https://github.com/Consensys/gnark-crypto/pull/446 on x86 (though it's slower than naive on Apple, probably due to significant memory bandwidth there).
Hey team,
I've been looking around the repo and I've seen that you consider bit-reversal permutations important enough to track them in benchmarks.
There is also a mention of cache-friendly algorithm: https://github.com/starkware-libs/stwo/blob/387a072dd7f4a56de3e196b779f9e392dfdd9406/crates/prover/src/core/utils.rs#L136-L153
The following is an in-place algorithm that is 33% faster than naive using cache-blocking (on my machine for EIP-4844 size):
https://github.com/mratsim/constantine/blob/65147ed/constantine/math/polynomials/fft.nim#L203-L295
Reference papers
Towards an Optimal Bit-Reversal Permutation Program Larry Carter and Kang Su Gatlin, 1998 https://csaws.cs.technion.ac.il/~itai/Courses/Cache/bit.pdf
Practically efficient methods for performing bit-reversed permutation in C++11 on the x86-64 architecture Knauth, Adas, Whitfield, Wang, Ickler, Conrad, Serang, 2017 https://arxiv.org/pdf/1708.01873.pdf
The performance improvement has been independently confirmed in Gnark https://github.com/Consensys/gnark-crypto/pull/446 on x86 (though it's slower than naive on Apple, probably due to significant memory bandwidth there).
Image courtesy of @gbotrel (amd desktop)