Noise simulations do not use streamed RNG

tskisner commented 8 years ago

Currently, OpSimNoise just implements different seeds for each detector in an observation. This does not guarantee statistical independence. Wrap rng123 with python so we can use it.

tskisner commented 8 years ago

Here is a link to the library we should add to the internal libctoast: https://www.deshawresearch.com/resources_random123.html

tskisner commented 8 years ago

It looks like the reikna package may already have a python implementation of the stateless algorithm used in random123:

https://github.com/fjarri/reikna/blob/develop/reikna/cbrng/cbrng.py

tskisner commented 8 years ago

I spent some time looking at the source code for random123, reikna, numpy.random, and our own legacy code in MOAT (a custom implementation of a different streamed RNG algorithm by L'Ecuyer, mrg32k3a). The requirements we have are:

For a given seed (starting state), we must be able to specify a 64-bit "stream index", where the values drawn from each stream are independent. Many off-the-shelf generators support only a 32-bit stream index, and we have hit that limit in the past.
For a given stream index, we need to be able draw a large number of samples from that stream.
We need to be able to seek ahead to an arbitrary spot in a stream.

For the mrg32k3a algorithm, the seek ahead is done by computing an operator based on the number of samples to skip and applying that to the state. This skip-ahead is expensive, but we don't need to do that operation very often. The mrg32k3a generator has a period of ~2^191, so I had divided that into 2^64 streams that were equally spaced across the full span of the generator.

The benefit of the newer Counter-Based RNG (CBRNG) algorithms is that they have very cheap skipping to arbitrary locations.

Looking at the random123 library, it is filled with lots of macros and cuda stuff. It is MIT licensed, so we could potentially redistribute it and wrap it with cython. Looking at the reikna code, we could either try to rip out the CBRNG and supporting code, or just require it as a dependency. It has a lot of stuff that we don't need, so it seems like overkill to require it. We could also just grab the source code from MOAT and wrap that implementation with cython. Obviously that would not provide the benefits of better algorithms, but it would be fairly straightforward.

keskitalo commented 8 years ago

Do we have any idea about the relative efficiency of these algorithms?

On Thu, May 5, 2016 at 7:25 AM, Theodore Kisner notifications@github.com wrote:

I spent some time looking at the source code for random123, reikna, numpy.random, and our own legacy code in MOAT (a custom implementation of a different streamed RNG algorithm by L'Ecuyer, mrg32k3a). The requirements we have are:

1.

For a given seed (starting state), we must be able to specify a 64-bit "stream index", where the values drawn from each stream are independent. Many off-the-shelf generators support only a 32-bit stream index, and we have hit that limit in the past. 2.

For a given stream index, we need to be able draw a large number of samples from that stream. 3.

We need to be able to seek ahead to an arbitrary spot in a stream.

For the mrg32k3a algorithm, the seek ahead is done by computing an operator based on the number of samples to skip and applying that to the state. This skip-ahead is expensive, but we don't need to do that operation very often. The mrg32k3a generator has a period of ~2^191, so I had divided that into 2^64 streams that were equally spaced across the full span of the generator.

The benefit of the newer Counter-Based RNG (CBRNG) algorithms is that they have very cheap skipping to arbitrary locations.

Looking at the random123 library, it is filled with lots of macros and cuda stuff. It is MIT licensed, so we could potentially redistribute it and wrap it with cython. Looking at the reikna code, we could either try to rip out the CBRNG and supporting code, or just require it as a dependency. It has a lot of stuff that we don't need, so it seems like overkill to require it. We could also just grab the source code from MOAT and wrap that implementation with cython. Obviously that would not provide the benefits of better algorithms, but it would be fairly straightforward.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/tskisner/pytoast/issues/6#issuecomment-217168290

tskisner commented 8 years ago

The only information I have is about the performance of the code I wrote. In that case, the generation of the Gaussian randoms was subdominant compared to the cost of the FFTs needed to generate noise timestreams.

Have we established whether those other algorithms can meet the requirements for a 64bit integer stream index and drawing more than 2^32 samples from each stream? On May 5, 2016 8:04 AM, "Reijo Keskitalo" notifications@github.com wrote:

Do we have any idea about the relative efficiency of these algorithms?

On Thu, May 5, 2016 at 7:25 AM, Theodore Kisner notifications@github.com wrote:

I spent some time looking at the source code for random123, reikna, numpy.random, and our own legacy code in MOAT (a custom implementation of a different streamed RNG algorithm by L'Ecuyer, mrg32k3a). The requirements we have are:

1.

For a given seed (starting state), we must be able to specify a 64-bit "stream index", where the values drawn from each stream are independent. Many off-the-shelf generators support only a 32-bit stream index, and we have hit that limit in the past. 2.

For a given stream index, we need to be able draw a large number of samples from that stream. 3.

We need to be able to seek ahead to an arbitrary spot in a stream.

For the mrg32k3a algorithm, the seek ahead is done by computing an operator based on the number of samples to skip and applying that to the state. This skip-ahead is expensive, but we don't need to do that operation very often. The mrg32k3a generator has a period of ~2^191, so I had divided that into 2^64 streams that were equally spaced across the full span of the generator.

The benefit of the newer Counter-Based RNG (CBRNG) algorithms is that they have very cheap skipping to arbitrary locations.

Looking at the random123 library, it is filled with lots of macros and cuda stuff. It is MIT licensed, so we could potentially redistribute it and wrap it with cython. Looking at the reikna code, we could either try to rip out the CBRNG and supporting code, or just require it as a dependency. It has a lot of stuff that we don't need, so it seems like overkill to require it. We could also just grab the source code from MOAT and wrap that implementation with cython. Obviously that would not provide the benefits of better algorithms, but it would be fairly straightforward.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/tskisner/pytoast/issues/6#issuecomment-217168290

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/tskisner/pytoast/issues/6#issuecomment-217178541

jsavarit commented 8 years ago

From the library random123, the most suited RNG to our usage seems to be Threefry. If we have AES hardware support, the ARS RNG could be a good option as well.

For Threefry-2x64, at each call:

we provide:
- a counter (2 x 64-bit) = the period of a stream is (2^64 * 2^64) = 10^38, which is long enough for a whole CMB S4 experiment (a 2^64 = 10^19 period would be too short)
- a key (2 x 64-bit) = each key generates an uncorrelated parallel stream
we get:
- two 64-bit random numbers (unsigned integers with a uniform distribution)

In the same fashion, Threefry-4x64 returns four 64-bit random numbers at each call. This could help increase overall performance

Performance comparison with conventional RNGs:

According to the table 2 from R123 paper, the R123 RNGs perform better than MRG32k3a on CPU. The implementation is indeed GP-GPU oriented because of the significant speed-up.

As these counter-based RNGs are derived from cryptographic algorithms, the former are lighter versions of the latter with a simplified cipher and a lower number of rounds. To reach a even higher performance, we can decrease the number of rounds (e.g. Threefry-2×64-20 performs 20 rounds while Threefry-2×64-13 performs only 13 rounds), at the cost of a smaller "safety margin".

Generating a gaussian distribution

~~ I suppose we need a normal distribution, is that correct? Do we use a uniform distribution as well? ~~ Since Threefry returns two random numbers, the Box-Muller transform is well suited for our purpose (it always takes and returns two random numbers at a time), even though it might be a bit less efficient than the Ziggurat algorithm (based on tables, so we need a high number of random numbers). Box-Muller requires calls to sincos() ,ln() and sqrt().

Has any of these algorithms already been implemented in TOAST? How was the uniform to normal distribution conversion dealt with?

Choice of the counter and key

To begin with, we can use the same key all the time, so that we generate numbers from the same stream (with a 2^128 = 10^38 period). With the counter, we can define the first 64-bit component based on the current subset index, and the second 64-bit component as the current position on the current subset. This gives us up to 2^64 different subsets of 2^64 random numbers, which after talking with Julian fits our requirements (e.g. CMB-S4). Alternatively, we can use the keys to generate parallel streams, and use each of these streams for each subset.

Implementation and cython wrapping

I'm currently adding an interface function to the R123 library, customized for our requirements, so we can wrap it from python. The function can return a normal distribution (by performing an implicit Box-Muller transform), or we can deal with this conversion at another level. Is this approach correct?

PS: I'm indeed receiving the notification emails.

tskisner commented 8 years ago

Thanks for the overview Jean. It sounds like the counter in these algorithms is indeed sufficient to store our stream index and sample. That's great that the performance is better than mrg32k3a as well.

As for the interface, I just wanted to understand your proposal. I think what you are saying is that you are writing a C interface that we can put in libctoast, and which provides a simpler API than the random123 library (i.e. it is just a way to create the threefry-2x64 generator and draw samples). Correct? Then we could call that from C/Cython higher-level functions, and we can also make a python interface to it.

I agree with this general approach. It will require changes to the ctoast library, the setup.py file, and also a new python class which wraps the C generator and has methods which call the underlying C code to draw samples (either uniform or normal distribution). For the Box-Mueller implementation, if we are already linking to random123, that library may already have the code for that. I also wrote a C/C++ implementation in the MOAT package. I'll move that old repo into hpc4cmb and give you access.

tskisner commented 8 years ago

This is fully implemented as of 43d346636423136117872e2044383c720d6a9d34.

hpc4cmb / toast