Closed swenson closed 1 month ago
Hmm. Benchmarks for this are not encouraging on my Threadripper 3970X (Zen 2), which I have heard are particularly weak in their FPU.
Before:
normal/normal time: [11874.0815 cycles 11893.9016 cycles 11912.4901 cycles]
thrpt: [2.9781 cpb 2.9735 cpb 2.9685 cpb]
After (with this change):
normal/normal time: [19646.2155 cycles 19662.4413 cycles 19682.2019 cycles]
thrpt: [4.9206 cpb 4.9156 cpb 4.9116 cpb]
change:
time: [+65.022% +65.173% +65.332%] (p = 0.00 < 0.05)
thrpt: [-39.516% -39.457% -39.402%]
I'll try to test on an Intel CPU later to see if it is different. (I was not able to compile on my arm64 due to criterion
.)
Intel (Core Ultra 7 165H) before:
normal/normal time: [9380.9673 cycles 9662.9064 cycles 9959.9462 cycles]
thrpt: [2.4900 cpb 2.4157 cpb 2.3452 cpb]
After:
normal/normal time: [13374.5863 cycles 13813.0605 cycles 14315.2989 cycles]
thrpt: [3.5788 cpb 3.4533 cpb 3.3436 cpb]
change:
time: [+34.629% +41.460% +48.790%] (p = 0.00 < 0.05)
thrpt: [-32.791% -29.309% -25.722%]
Better, but still a serious regression for f32
performance on modern x86-64 platforms.
Which makes me think it is not worth doing this, or it we should lock it behind a feature
or architecture-specific gating?
If someone has a use-case (a platform where this change is beneficial), I'd like to see their benchmarks/evidence here.
Complication: this is a value-breaking change. This is an argument against implementing this via a feature-flag (though not a complete blocker).
I suggest we leave this open for a couple of weeks to see if anyone can provide evidence of benefit, otherwise close (we can revisit later if and when evidence pops up).
Makes sense.
It is possible to make it less value-breaking if we generated u64
random values and pulled bits in a similar way to the f64
code path, though the results might differ slightly due to rounding. But, that is potentially slower and probably still won't be exactly the same.
I have a bunch of 32-bit microcontroller development boards, but none that run a real OS, so I'd have to write a custom benchmarking program for them to see if this is beneficial.
As you say, results would still not be identical so I don't see the benefit of that approach.
It's unclear to me if this is something that would benefit you personally or just some TODO you found. If the former, please quantify.
This was mostly a TODO I found that would be simple enough and possibly worth the optimization. (I do a lot of 32-bit embedded development, but rarely use FP.)
Closing since no one has provided any actual motivation for this.
Summary
The current normal distribution
f32
implementation just computes thef64
version and converts the result tof32
.CHANGELOG.md
entryMotivation
The TODO suggested this, and I agreed. This should be significantly faster on platforms where single-precision floats are faster.
Details
I converted the code to use
f32
arithmetic and adjusted the number of bits used for the calculation ofu
, the base number.In addition to the tests, I also generated some normal numbers with the current f64 and the new f32 paths and ensured that they both look "normal".
Tests that depended on the specific
f32
values generated by the old normal distribution code had to be updated, since we now useu32
-sized random numbers as the source, which will be different from theu64
s used previously.I also fixed some comments that were confusing and possibly incorrect in the ziggurat code.
I was originally tempted to convert the ziggurat tables to
f32
for all versions, but this appears to affect the precision of thef64
generated numbers too much.