Optim: bit-reversal permutation with cache-blocking

Hey team,

I've been looking around the repo and I've seen that you consider bit-reversal permutations important enough to track them in benchmarks.

There is also a mention of cache-friendly algorithm: https://github.com/starkware-libs/stwo/blob/387a072dd7f4a56de3e196b779f9e392dfdd9406/crates/prover/src/core/utils.rs#L136-L153

The following is an in-place algorithm that is 33% faster than naive using cache-blocking (on my machine for EIP-4844 size):

https://github.com/mratsim/constantine/blob/65147ed/constantine/math/polynomials/fft.nim#L203-L295

Reference papers

Towards an Optimal Bit-Reversal Permutation Program Larry Carter and Kang Su Gatlin, 1998 https://csaws.cs.technion.ac.il/~itai/Courses/Cache/bit.pdf
Practically efficient methods for performing bit-reversed permutation in C++11 on the x86-64 architecture Knauth, Adas, Whitfield, Wang, Ickler, Conrad, Serang, 2017 https://arxiv.org/pdf/1708.01873.pdf

The performance improvement has been independently confirmed in Gnark https://github.com/Consensys/gnark-crypto/pull/446 on x86 (though it's slower than naive on Apple, probably due to significant memory bandwidth there).

Image courtesy of @gbotrel (amd desktop)

starkware-libs / stwo

Optim: bit-reversal permutation with cache-blocking #813