rust-random / rand

A Rust library for random number generation.
https://crates.io/crates/rand
Other
1.67k stars 432 forks source link

Use `f32` arithmetic for `f32` normal distribution #1485

Closed swenson closed 1 month ago

swenson commented 3 months ago

Summary

The current normal distribution f32 implementation just computes the f64 version and converts the result to f32.

Motivation

The TODO suggested this, and I agreed. This should be significantly faster on platforms where single-precision floats are faster.

Details

I converted the code to use f32 arithmetic and adjusted the number of bits used for the calculation of u, the base number.

In addition to the tests, I also generated some normal numbers with the current f64 and the new f32 paths and ensured that they both look "normal".

Tests that depended on the specific f32 values generated by the old normal distribution code had to be updated, since we now use u32-sized random numbers as the source, which will be different from the u64s used previously.

I also fixed some comments that were confusing and possibly incorrect in the ziggurat code.

I was originally tempted to convert the ziggurat tables to f32 for all versions, but this appears to affect the precision of the f64 generated numbers too much.

swenson commented 3 months ago

Hmm. Benchmarks for this are not encouraging on my Threadripper 3970X (Zen 2), which I have heard are particularly weak in their FPU.

Before:

normal/normal           time:   [11874.0815 cycles 11893.9016 cycles 11912.4901 cycles]
                        thrpt:  [2.9781 cpb 2.9735 cpb 2.9685 cpb]

After (with this change):

normal/normal           time:   [19646.2155 cycles 19662.4413 cycles 19682.2019 cycles]
                        thrpt:  [4.9206 cpb 4.9156 cpb 4.9116 cpb]
                 change:
                        time:   [+65.022% +65.173% +65.332%] (p = 0.00 < 0.05)
                        thrpt:  [-39.516% -39.457% -39.402%]

I'll try to test on an Intel CPU later to see if it is different. (I was not able to compile on my arm64 due to criterion.)

swenson commented 3 months ago

Intel (Core Ultra 7 165H) before:

normal/normal           time:   [9380.9673 cycles 9662.9064 cycles 9959.9462 cycles]
                        thrpt:  [2.4900 cpb 2.4157 cpb 2.3452 cpb]

After:

normal/normal           time:   [13374.5863 cycles 13813.0605 cycles 14315.2989 cycles]
                        thrpt:  [3.5788 cpb 3.4533 cpb 3.3436 cpb]
                 change:
                        time:   [+34.629% +41.460% +48.790%] (p = 0.00 < 0.05)
                        thrpt:  [-32.791% -29.309% -25.722%]

Better, but still a serious regression for f32 performance on modern x86-64 platforms.

Which makes me think it is not worth doing this, or it we should lock it behind a feature or architecture-specific gating?

dhardy commented 3 months ago

If someone has a use-case (a platform where this change is beneficial), I'd like to see their benchmarks/evidence here.

Complication: this is a value-breaking change. This is an argument against implementing this via a feature-flag (though not a complete blocker).

I suggest we leave this open for a couple of weeks to see if anyone can provide evidence of benefit, otherwise close (we can revisit later if and when evidence pops up).

swenson commented 3 months ago

Makes sense.

It is possible to make it less value-breaking if we generated u64 random values and pulled bits in a similar way to the f64 code path, though the results might differ slightly due to rounding. But, that is potentially slower and probably still won't be exactly the same.

I have a bunch of 32-bit microcontroller development boards, but none that run a real OS, so I'd have to write a custom benchmarking program for them to see if this is beneficial.

dhardy commented 3 months ago

As you say, results would still not be identical so I don't see the benefit of that approach.

It's unclear to me if this is something that would benefit you personally or just some TODO you found. If the former, please quantify.

swenson commented 3 months ago

This was mostly a TODO I found that would be simple enough and possibly worth the optimization. (I do a lot of 32-bit embedded development, but rarely use FP.)

dhardy commented 1 month ago

Closing since no one has provided any actual motivation for this.