Open thouis opened 12 years ago
Attachment in Trac by trac user ilan, 2010-04-10: ziggurat.patch
Comment in Trac by atmention:charris, 2010-04-10
We've had better ziggurat methods than those in Doornik for some years now, but there has been little push in getting them into numpy.
Hi there.
I've completed timing and robustness tests that should
allow apples to apples comparisons of the mtrand normal distribution
possibilities. The short story goes like this:
0) One of the 'Harris' ziggurats, zig7, passes all TestU01 crush tests
when running atop each of five PRNGS.
1) Depending on how fast the underlying PRNG is, zig7
outperforms Box_Muller by ~1.3X to ~8.5X.
2) using gcc, mwc8222+zig7 outperforms mt19937+Box_Muller by 7X.
See you both at SciPy.
Bruce
And here are the details:
In what follows the uniform variate generators are:
lcg64
mwc8222
mt19937
mt19937_64
yarn5
And the normal distribution codes are:
trng - default normal distribution code in TRNG
boxm - Box-Muller, mtrand lookalike, remembers/uses 2nd value
zig7 - a 'Harris' ziggurat indexed by 7 bits
zig8 - a 'Harris' ziggurat indexed by 8 bits
zig9 - a 'Harris' ziggurat indexed by 9 bits
Here are the numbers in more detail:
# Timings from icc -O2 running on 2.4GhZ Core-2
lcg64 trng: 6.52459e+06 ops per second
lcg64 boxm: 2.18453e+07 ops per second
lcg64 zig7: 1.80616e+08 ops per second
lcg64 zig8: 2.01865e+08 ops per second
lcg64 zig9: 2.05156e+08 ops per second
mwc8222 trng: 6.52459e+06 ops per second
mwc8222 boxm: 2.08787e+07 ops per second
mwc8222 zig7: 9.44663e+07 ops per second
mwc8222 zig8: 1.05326e+08 ops per second
mwc8222 zig9: 1.03478e+08 ops per second
mt19937 trng: 6.41112e+06 ops per second
mt19937 boxm: 1.64986e+07 ops per second
mt19937 zig7: 4.23762e+07 ops per second
mt19937 zig8: 4.52623e+07 ops per second
mt19937 zig9: 4.52623e+07 ops per second
mt19937_64 trng: 6.42509e+06 ops per second
mt19937_64 boxm: 1.93226e+07 ops per second
mt19937_64 zig7: 5.8762e+07 ops per second
mt19937_64 zig8: 6.17213e+07 ops per second
mt19937_64 zig9: 6.29146e+07 ops per second
yarn5 trng: 5.95781e+06 ops per second
yarn5 boxm: 1.19156e+07 ops per second
yarn5 zig7: 1.48945e+07 ops per second
yarn5 zig8: 1.54809e+07 ops per second
yarn5 zig9: 1.53201e+07 ops per second
# Timings from g++ -O2 running on a 2.4GhZ Core-2
lcg64 trng: 6.72163e+06 ops per second
lcg64 boxm: 1.50465e+07 ops per second
lcg64 zig7: 1.31072e+08 ops per second
lcg64 zig8: 1.48383e+08 ops per second
lcg64 zig9: 1.6036e+08 ops per second
mwc8222 trng: 6.64215e+06 ops per second
mwc8222 boxm: 1.44299e+07 ops per second
mwc8222 zig7: 8.903e+07 ops per second
mwc8222 zig8: 1.00825e+08 ops per second
mwc8222 zig9: 1.03478e+08 ops per second
mt19937 trng: 6.52459e+06 ops per second
mt19937 boxm: 1.28223e+07 ops per second
mt19937 zig7: 5.00116e+07 ops per second
mt19937 zig8: 5.41123e+07 ops per second
mt19937 zig9: 5.47083e+07 ops per second
mt19937_64 trng: 6.58285e+06 ops per second
mt19937_64 boxm: 1.42988e+07 ops per second
mt19937_64 zig7: 6.72164e+07 ops per second
mt19937_64 zig8: 7.39591e+07 ops per second
mt19937_64 zig9: 7.46022e+07 ops per second
yarn5 trng: 6.25144e+06 ops per second
yarn5 boxm: 8.93672e+06 ops per second
yarn5 zig7: 1.50465e+07 ops per second
yarn5 zig8: 1.57496e+07 ops per second
yarn5 zig9: 1.56038e+07 ops per second
You should contact Bruce Carneal bcarnealatmention:gmail.com and see if you can get his code. It is in c++ but can probably be translated to c without too much trouble. If we are going the ziggurat direction, it would be nice to implement the exponential distribution also. As you can see, the overall speedup also depends on the underlying rng, so if you can figure out a way of making choices available there that will also help. Of course, for small sets of values call overhead will dominate everything.
Thanks for looking into this problem, all it needs is for someone to take it in hand and push it through.
Comment in Trac by trac user ilan, 2010-04-10
Thanks for your quick response. Yes, the speedup depends on the underlying RNG, and the patch I propose is just using what is already there. The method is quite simple and already a more than 2 times speedup. My felling is that by using faster underlying RNGs, a lot more could be done, but this is really a separate issue.
Writing the Ziggurat method more general, such that exponential (and possibly other) distributions can be done with the same code, might be possible but I don't think its worth the effort, as the code is very small, and also the tail distribution is different in other cases.
Comment in Trac by atmention:charris, 2010-04-10
The 'Harris' ziggurat is mine and the code is actually simpler than the Marsaglia version (Doornik). I also did the exponential (as did Marsaglia), which is even easier. Notice that the Marsaglia version actually uses an exponential to bound the tail. So do I, but I make it part of the ziggurat, which simplifies matters. It takes maybe ten lines of python to compute the needed constants for the various ziggurats. Note that in general you can get about 4x speedup vs Box-Muller for the MT rng.
Comment in Trac by atmention:charris, 2010-04-10
A few additional comments.
1) The random variable
u = 2.0 * rk_double(state) - 1.0
isn't evenly distributed between positive and negative. I don't think that makes any difference in practice given the other roundoff errors, but is worth noting because the steps in the ziggurat are symmetric about zero. Accepting the roundoff errors, it can be fixed by shifting and scaling the ziggurat layer to [0, 1-eps], using the positive double, and shifting and scaling the return value. This also avoids the call to fabs and saves one floating multiplication.
2) /* first try the rectangular boxes / if (fabs(u) < s_adZigR[i]) return u \ s_adZigX[i]; uses a double in the comparison. One of the virtues of the original ziggurat was using integers here, which is most likely faster even on modern hardware.
3) It would be nice to move the initialization to load time so that there is no need to check it for every call. That isn't likely to make much difference given the overhead of the call to the rng but it looks a bit out of place.
4) The function zig_NormalTail can be removed if an improved ziggurat is used.
5) You can avoid the occasional extra call to rk_random if you handle the double conversion yourself. There are enough bits left over from the double to deal with the ziggurat layers. Having the pre-conversion integer form will also allow you to use integer comparisons in 2).
Chuck
Comment in Trac by atmention:charris, 2010-06-13
Original ticket http://projects.scipy.org/numpy/ticket/1450 Reported 2010-04-10 by trac user ilan, assigned to atmention:charris.
I've written a patch which replaces the Box-Muller transform in numpy/random/mtrand/randomkit.c with the faster Ziggurat method. The patch also includes updates to doc-strings which include the relevant references:
My patch is based on the support code which J. Doornik has made available on his web page (first reference). I've contacted him, and he has given permission to include his code in numpy on the condition that his paper is referenced (which the patch does).
The patch was written against http://svn.scipy.org/svn/numpy/tags/1.4.0, and I've just checked to make sure it can still be applied to the trunk (revision 8324). I've tested the patch on Windows, MacOSX, Linux and Solaris.
The speedup is a bit over 2 times, when creating a large number of Standard Normal distributed random numbers. To verify that the method works and produces good random numbers, I have:
Feel free to hammer the patch. I hope this patch will make its way into numpy soon.