Open OlivierSohn opened 6 years ago
Essentially it's about avoiding two stored into array and then reading them back immediately. On the surface it looks really similar to the deforestation but I don't know whether it's possible to implement it via rewrite rules. At very least it seems hard.
On completely unrelated note I thought about unrelated microoptimization: replace unboxed arrays with arrays from primitive. Indexing should be faster there since those don't support slicing.
Thanks, I wasn't aware of the indexing overhead, I never used primitive arrays but I'll try it out when I have some time!
Deforestation is also a new subject for me, but I feel it would be complicated (and lead to complicated code?) to explain to the compiler how to fuse two operations, I might be wrong though ... and would definitely like to see how this could be done :)
I see a 2.5% performance boost when using a Storable. Interesting!
Note that it is a breaking change, since not anymore (see below)Seed
now has a Storable instead of Unboxed.
Indexing of unboxed arrays is has inherent slowdown because of slicing support. It's done as arr[off + i]
instead of arr[i]
so you have extra addition. Actually I had arrays from primitive
in mind. I'll try to compare performance of unboxed/storable/primitive
Another approach for reducing number of read/writes to state vector is to turn generator into monad Word32 -> Word32 -> (# a, Word32, Word32 #)
and let GHC optimier deal with it. It would be interesting to compare performance but it's complete API breakage. Maybe it's good thing
I tried with arrays from primitive, I see a slight speedup, but it's unclear if it's noise or not...
I kept the Seed
type as it was before, i.e with an unboxed vector, so it is not a breaking change anymore
I finally got time to work on PRs.
I cherry picked changes for Gen
representation and run becnhmarks. Switch to primitve arrays improved performance by ~5% on my computer. Free 5% is nothing to sneeze at. But all meddling with low level arrays makes me nervous. I'm going to recheck everything and maybe write some tests. Then I'll push them to master
I also checked time of generation of Word32
and Word64
. If we say that time for Word32
is read + update + store
, and for Word64
it's read + 2·update + write
. (read and write corresponds to index and carry). Then both update
and (read+write)/2
take about 4ns. We have big performance problem at our hands by forcing GHC to store/read variables needlessly. Approach with fold solves it in special case. At least we need to expose enough internals in safe manner so people could write such special purpose folds if they need them
I agree.
Another thing to consider (and maybe document for the user) is that with the Gen
representation change, the array is pinned, so the GC won't be able to move it. It may or may not be a good thing, depending on the application I guess...
I finally got around to measure impact of different vector variants. Here is distribution of run times:
Unboxed, Primitive, and PrimArray from primitive perform identically and Storable is about 5% slower. Probably because of extra pointer chase in ForeignPtr. I thought that probably primitive vector would lead to faster build but unboxed turned out to be slightly faster: 15.9s vs 16.1s. So current vector backend is likely optimal
I'll get to the foldMUniforms
tomorrow.
I rebased PR over current master. First of all benchmarks: it does provide nice ~25% speedup over replicateM_ n (uniform gen)
.
What isn't very good. It provides very specific primitive: iterate function N times. Could it be generalized? One thing that comes to mind is unfolds. Maybe there's something that could usefully generalize both?
There's obvious thing turn passing of i
& c
into monad but that turns 25% speedup into 10% slowdown presumably from state carrying boilerplate. Loop should be generated in the function so GHC could optimize it well
Hello @Shimuuar, reading this MR brings back memories from when I was developping my little console game, which was a lot of fun! In the meantime I have moved to other personal projects (music, convolution reverbs, etc...) so I won't pursue the initial goal of merging this MR, but I hope it will be useful, some parts of it at least! Cheers, Olivier
Well 25% speedup is nothing to sneer at :). I think I'll release 0.15 without this PR and start updating statistics. Then I'll revisit this PR. Maybe I'll get some idea in the meantime
This is probably not meargable as-is, because it's not in the spirit of the current API, but 'foldMUniforms' could be used to implement a new fold-like function in the class
Variate
, to allow creating N random values that are consumed by a monadic accumulating function.Note that creating N numbers with 'foldMUniforms' will require n+2 reads and writes to the state vector, whereas creating N numbers with 'uniform' requires 3*n reads and writes to the state vector. Benchmarks on my application show a speed-up, because random number generation is a bottleneck for me.