Optimized unxorshift - Githubissues

imneme / pcg-cpp

PCG — C++ Implementation

Apache License 2.0

745 stars 99 forks source link

I was looking through how you implemented unxorshift, and noticed it seems over-complicated and could be optimized using a clever implementation.

The current implementation un-xorshifts the value shift bits at a time. The idea there is clear: only the first shift bits are unaffected by the xorshift, so you first use those to recover the next shift bits, which are used to recover the next shift bits, etc. The current implementation chose to implement this repeat recursively instead of iteratively for some reason, which may be less optimal if the compiler doesn't optimize it (which it is less likely to do since it's not tail recursion).

My implementation makes use of the fact that if y = x ^ (x >> c), then y ^ (y >> c) = x ^ (x >> 2c):

  y ^ (y >> c)
= (x ^ (x >> c)) ^ ((x ^ (x >> c)) >> c)
= x ^ (x >> c) ^ (x >> c) ^ (x >> 2c)
= x ^ (x >> 2c)

This allows for a simple recursive step: unxorshift(x, bits, shift) -> unxorshift(x ^ (x >> shift), bits, shift * 2). Since this is tail recursion, this can be trivially optimized as an iterative solution, which I went ahead and did manually in this PR (since it was pretty simple to do). I could have implemented it as a for loop, but since the initial loop condition check is unnecessary given the precondition shift < bits, implementing it with a do-while loop removes a conditional jump from the compiled assembly.

This new implementation also scales better with larger input sizes. The current implementation recurses ceil(bits / shift - 1) times, which, assuming a constant value for shift, is linear. My implementation loops ceil(log_2(bits / shift)) times, which is logarithmic.

You can see just how much this optimizes the implementation by comparing the resulting assembly code (shown for 32-bit values): https://godbolt.org/z/E64G7vq83 (Using Clang because GCC does some optimizations that make the comparison look worse than it is).

Run on (12 X 3696 MHz CPU s) CPU Caches: L1 Data 32 KiB (x6) L1 Instruction 32 KiB (x6) L2 Unified 256 KiB (x6) L3 Unified 12288 KiB (x1) ---------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------- BM<unxorshift_new>/1 4.63 ns 4.50 ns 149333333 BM<unxorshift_new>/2 3.60 ns 3.61 ns 203636364 BM<unxorshift_new>/4 3.35 ns 3.37 ns 213333333 BM<unxorshift_new>/8 2.39 ns 2.35 ns 298666667 BM<unxorshift_new>/16 1.46 ns 1.46 ns 448000000 BM<unxorshift_old>/1 101 ns 103 ns 7466667 BM<unxorshift_old>/2 47.1 ns 47.5 ns 15448276 BM<unxorshift_old>/3 30.3 ns 30.5 ns 23578947 BM<unxorshift_old>/4 21.2 ns 20.9 ns 34461538 BM<unxorshift_old>/5 16.9 ns 16.9 ns 40727273 BM<unxorshift_old>/6 13.8 ns 13.8 ns 49777778 BM<unxorshift_old>/7 10.6 ns 10.7 ns 74666667 BM<unxorshift_old>/8 7.28 ns 7.25 ns 112000000 BM<unxorshift_old>/11 4.18 ns 4.17 ns 172307692 BM<unxorshift_old>/16 1.56 ns 1.53 ns 448000000

imneme / pcg-cpp

Optimized unxorshift #82