imneme / pcg-cpp

PCG — C++ Implementation
Apache License 2.0
745 stars 99 forks source link

Optimized unxorshift #82

Open LRFLEW opened 1 year ago

LRFLEW commented 1 year ago

I was looking through how you implemented unxorshift, and noticed it seems over-complicated and could be optimized using a clever implementation.

The current implementation un-xorshifts the value shift bits at a time. The idea there is clear: only the first shift bits are unaffected by the xorshift, so you first use those to recover the next shift bits, which are used to recover the next shift bits, etc. The current implementation chose to implement this repeat recursively instead of iteratively for some reason, which may be less optimal if the compiler doesn't optimize it (which it is less likely to do since it's not tail recursion).

My implementation makes use of the fact that if y = x ^ (x >> c), then y ^ (y >> c) = x ^ (x >> 2c):

  y ^ (y >> c)
= (x ^ (x >> c)) ^ ((x ^ (x >> c)) >> c)
= x ^ (x >> c) ^ (x >> c) ^ (x >> 2c)
= x ^ (x >> 2c)

This allows for a simple recursive step: unxorshift(x, bits, shift) -> unxorshift(x ^ (x >> shift), bits, shift * 2). Since this is tail recursion, this can be trivially optimized as an iterative solution, which I went ahead and did manually in this PR (since it was pretty simple to do). I could have implemented it as a for loop, but since the initial loop condition check is unnecessary given the precondition shift < bits, implementing it with a do-while loop removes a conditional jump from the compiled assembly.

This new implementation also scales better with larger input sizes. The current implementation recurses ceil(bits / shift - 1) times, which, assuming a constant value for shift, is linear. My implementation loops ceil(log_2(bits / shift)) times, which is logarithmic.

You can see just how much this optimizes the implementation by comparing the resulting assembly code (shown for 32-bit values): https://godbolt.org/z/E64G7vq83 (Using Clang because GCC does some optimizations that make the comparison look worse than it is).

LRFLEW commented 1 year ago

I decided to actually benchmark this change so that I had some numbers to go with it. I used this code for the benchmark. Here are the results when I ran it on my machine (compiled with MSVC as x64, run on an Intel 8700K). The number after the slash is the value for shift (I only included values that result in a different number of iteration/recursion steps), and all tests were done with 32 bit values.

Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM<unxorshift_new>/1        4.63 ns         4.50 ns    149333333
BM<unxorshift_new>/2        3.60 ns         3.61 ns    203636364
BM<unxorshift_new>/4        3.35 ns         3.37 ns    213333333
BM<unxorshift_new>/8        2.39 ns         2.35 ns    298666667
BM<unxorshift_new>/16       1.46 ns         1.46 ns    448000000
BM<unxorshift_old>/1         101 ns          103 ns      7466667
BM<unxorshift_old>/2        47.1 ns         47.5 ns     15448276
BM<unxorshift_old>/3        30.3 ns         30.5 ns     23578947
BM<unxorshift_old>/4        21.2 ns         20.9 ns     34461538
BM<unxorshift_old>/5        16.9 ns         16.9 ns     40727273
BM<unxorshift_old>/6        13.8 ns         13.8 ns     49777778
BM<unxorshift_old>/7        10.6 ns         10.7 ns     74666667
BM<unxorshift_old>/8        7.28 ns         7.25 ns    112000000
BM<unxorshift_old>/11       4.18 ns         4.17 ns    172307692
BM<unxorshift_old>/16       1.56 ns         1.53 ns    448000000

(Side note: the way the test was written means that it sees the shift amount as a dynamic value, so it works like it would for an inverse RXS step. In the case where the shift amount is a constant, the compiler can use loop unrolling to further optimize my implementation. I think this loop unrolling looks particularly elegant in ARM.)