Produce multiple random numbers efficiently

haskell / mwc-random

A very fast Haskell library for generating high quality pseudo-random numbers.

http://hackage.haskell.org/package/mwc-random

BSD 2-Clause "Simplified" License

55 stars 25 forks source link

Produce multiple random numbers efficiently #66

Open OlivierSohn opened 6 years ago

OlivierSohn commented 6 years ago

This is probably not meargable as-is, because it's not in the spirit of the current API, but 'foldMUniforms' could be used to implement a new fold-like function in the class Variate, to allow creating N random values that are consumed by a monadic accumulating function.

Note that creating N numbers with 'foldMUniforms' will require n+2 reads and writes to the state vector, whereas creating N numbers with 'uniform' requires 3*n reads and writes to the state vector. Benchmarks on my application show a speed-up, because random number generation is a bottleneck for me.

Shimuuar commented 6 years ago

Essentially it's about avoiding two stored into array and then reading them back immediately. On the surface it looks really similar to the deforestation but I don't know whether it's possible to implement it via rewrite rules. At very least it seems hard.

On completely unrelated note I thought about unrelated microoptimization: replace unboxed arrays with arrays from primitive. Indexing should be faster there since those don't support slicing.

OlivierSohn commented 6 years ago

Thanks, I wasn't aware of the indexing overhead, I never used primitive arrays but I'll try it out when I have some time!

Deforestation is also a new subject for me, but I feel it would be complicated (and lead to complicated code?) to explain to the compiler how to fuse two operations, I might be wrong though ... and would definitely like to see how this could be done :)

OlivierSohn commented 6 years ago

I see a 2.5% performance boost when using a Storable. Interesting!

OlivierSohn commented 6 years ago

~~Note that it is a breaking change, since Seed now has a Storable instead of Unboxed.~~ not anymore (see below)

Shimuuar commented 6 years ago

Indexing of unboxed arrays is has inherent slowdown because of slicing support. It's done as arr[off + i] instead of arr[i] so you have extra addition. Actually I had arrays from primitive in mind. I'll try to compare performance of unboxed/storable/primitive

Another approach for reducing number of read/writes to state vector is to turn generator into monad Word32 -> Word32 -> (# a, Word32, Word32 #) and let GHC optimier deal with it. It would be interesting to compare performance but it's complete API breakage. Maybe it's good thing

OlivierSohn commented 6 years ago

I tried with arrays from primitive, I see a slight speedup, but it's unclear if it's noise or not...

I kept the Seed type as it was before, i.e with an unboxed vector, so it is not a breaking change anymore

Shimuuar commented 6 years ago

I finally got time to work on PRs.

I cherry picked changes for Gen representation and run becnhmarks. Switch to primitve arrays improved performance by ~5% on my computer. Free 5% is nothing to sneeze at. But all meddling with low level arrays makes me nervous. I'm going to recheck everything and maybe write some tests. Then I'll push them to master

I also checked time of generation of Word32 and Word64. If we say that time for Word32 is read + update + store, and for Word64 it's read + 2·update + write. (read and write corresponds to index and carry). Then both update and (read+write)/2 take about 4ns. We have big performance problem at our hands by forcing GHC to store/read variables needlessly. Approach with fold solves it in special case. At least we need to expose enough internals in safe manner so people could write such special purpose folds if they need them

OlivierSohn commented 6 years ago

I agree.

Another thing to consider (and maybe document for the user) is that with the Gen representation change, the array is pinned, so the GC won't be able to move it. It may or may not be a good thing, depending on the application I guess...

Shimuuar commented 4 years ago

I finally got around to measure impact of different vector variants. Here is distribution of run times:

Unboxed, Primitive, and PrimArray from primitive perform identically and Storable is about 5% slower. Probably because of extra pointer chase in ForeignPtr. I thought that probably primitive vector would lead to faster build but unboxed turned out to be slightly faster: 15.9s vs 16.1s. So current vector backend is likely optimal

I'll get to the foldMUniforms tomorrow.

Shimuuar commented 4 years ago

I rebased PR over current master. First of all benchmarks: it does provide nice ~25% speedup over replicateM_ n (uniform gen).

What isn't very good. It provides very specific primitive: iterate function N times. Could it be generalized? One thing that comes to mind is unfolds. Maybe there's something that could usefully generalize both?

There's obvious thing turn passing of i & c into monad but that turns 25% speedup into 10% slowdown presumably from state carrying boilerplate. Loop should be generated in the function so GHC could optimize it well

OlivierSohn commented 4 years ago

Hello @Shimuuar, reading this MR brings back memories from when I was developping my little console game, which was a lot of fun! In the meantime I have moved to other personal projects (music, convolution reverbs, etc...) so I won't pursue the initial goal of merging this MR, but I hope it will be useful, some parts of it at least! Cheers, Olivier

Shimuuar commented 4 years ago

Well 25% speedup is nothing to sneer at :). I think I'll release 0.15 without this PR and start updating statistics. Then I'll revisit this PR. Maybe I'll get some idea in the meantime