Noise default seed discussion

marcobortolami commented 1 year ago

Hi:) I open this Issue to discuss about the seed used in the random number generator if it is not specified when, for example, the user wants to produce noise TODs.

Description of my concern In my opinion:

if the seed is not specified and the user runs two times a script producing noise timelines, then the noise produced should be different.
if the seed is specified and the user runs two times a script producing noise timelines, then the noise produced should be the same (two runs with the same seed of course).

However, with the current version of the code, depending on which (high or low) level function the user calls, the noise timelines produced without specifying a seed are different or equal for two runs of the same code. I will make myself more clear with the following examples.

CASE 1: high level function called, no seed specified

import litebird_sim as lbs
import numpy as np
import matplotlib.pylab as plt
from astropy import units, time

sim = lbs.Simulation(
    start_time=time.Time("2025-01-01T00:00:00"),
    duration_s=1 * units.minute.to("s"),
)

detector = lbs.DetectorInfo(
    name="foo",
    sampling_rate_hz=10.0,
    bandcenter_ghz=200.0,
    net_ukrts=50.0,
    fknee_mhz=20.0,
    fmin_hz=1e-05,
    alpha=1.0,
)

sim.create_observations(
    detectors=[detector],
)

sim.add_noise('white')

times = sim.observations[0].get_times()-sim.observations[0].start_time.cxcsec

plt.plot(times,sim.observations[0].tod[0,:])
plt.xlabel("Time [s]")
plt.ylabel("Signal [K]")

if you run the code above two times, the noise TODs are the same. This happens because here sim.random is set to None in the constructor of Simulation, but a few lines later here the function self.init_random() is called and two lines later you seed that the seed 12345 is used. Then, in sim.add_noise() we have

if random is None:
    random = self.random

and this is passed to lbs.add_noise_to_observations().

CASE 2: low level function called, no seed specified

import litebird_sim as lbs
import numpy as np
import matplotlib.pylab as plt
from astropy import units, time

sim = lbs.Simulation(
    start_time=time.Time("2025-01-01T00:00:00"),
    duration_s=1 * units.minute.to("s"),
)

detector = lbs.DetectorInfo(
    name="foo",
    sampling_rate_hz=10.0,
    bandcenter_ghz=200.0,
    net_ukrts=50.0,
    fknee_mhz=20.0,
    fmin_hz=1e-05,
    alpha=1.0,
)

(obs,) = sim.create_observations(
    detectors=[detector],
    tods=[lbs.TodDescription(name="tod",description="TOD",dtype=np.float32)],
)

lbs.add_noise_to_observations(
    obs,
    'white',
    scale=1,
)

times = obs.get_times()-obs.start_time.cxcsec

plt.plot(times,obs.tod[0,:])
plt.xlabel("Time [s]")
plt.ylabel("Signal [K]")

if you run the code above two times, the noise TODs are different. This happens because if you do not specify the random parameter, it remains None until add_white_noise (or add_one_over_f_noise()), where we have

if random is None:
    random = np.random.default_rng()

Proposal and questions

In sim.add_noise(), if the seed is not specified, use np.random.default_rng() instead of self.random. In this way the behaviour of the high level function is the same as the low level ones (no seed --> no reproducibility). I can take care of the modification, if you are fine with this.
What to do for the seed in sim.init_random? Shall we leave 12345 or remove the default so the user must specify a value? Do we really need to initialize the random number generator in the Simulation constructor?
Add a Noise section to the documentation, where we describe the models used for the noises and talk about the seed, saying that if you do not specify it, you do not have reproducibility, and if you want reproducibility you must call sim.init_random specifying the seed. This could be useful for example if we add more noise models, to track the changes and make sure that the users know what they are producing without opening the code. I can take care of this as well.

I will wait for a feedback before proceeding with the PRs :)

mreineck commented 1 year ago

I don't have a good answer for all of the questions, but from earlier experience I strongly suggest to force users to specify a random seed, and not assume some default behaviour if it is not specified. Any "default" behaviour that is not totally obvious to all users is dangerous.

marcobortolami commented 1 year ago

Thanks for the feedback. You mean, for example, that an error should be raised if the user did not specify a random seed (or random number generator) in the script when he/she calls a function where the random number generator is needed?

mreineck commented 1 year ago

Looking at the current implementation, I think the most explicit way would be to remove Simulation.random (it is used only in this one place so far, as far as I can see), and require that the user always passes an valid RNG to add_noise.

There are a few open points, however:

is the RNG's internal state updated in place? I.e. if I call ...add_noise(random=my_rng), has th state of my_rng changed after the call? It would be important that this happens.
what kind of guarantee do we give when MPI is active? How do we generate noise time streams when the corresponding TOD is distributed over several tasks? Do we guarantee that the noise (given the same RNG) will be identical independently of the number of MPI tasks?

Sorry, the combination of random numbers and MPI is always a really nasty can of worms...

ziotom78 commented 1 year ago

Hi all,

Thanks @marcobortolami for having spotted this inconsistency!

I don't have a good answer for all of the questions, but from earlier experience I strongly suggest to force users to specify a random seed, and not assume some default behaviour if it is not specified. Any "default" behaviour that is not totally obvious to all users is dangerous.

I strongly support this, even at the price of introducing a breaking change. It's much better that the user knows whether or not they are going to get the same noise every time they re-run their script.

what kind of guarantee do we give when MPI is active? How do we generate noise time streams when the corresponding TOD is distributed over several tasks? Do we guarantee that the noise (given the same RNG) will be identical independently of the number of MPI tasks?

The reason why I added the method init_random to the Simulation class was exactly to create a RNG that works nicely with MPI processes. In other words, if you do the following:

# This script is meant to be run using `mpirun` with N processes
sim = Simulation(…)
sim.init_random(123) # Use the same seed for every MPI process
tod = sim.random.randn(100_000)

then every MPI process will not generate the same numbers in tod, because every process gets an orthogonal generator, even if the seed is fixed.

I think the most explicit way would be to remove Simulation.random (it is used only in this one place so far, as far as I can see), and require that the user always passes an valid RNG to add_noise

No, I would keep Simulation.init_random() for the reason stated above: it was meant to prevent common MPI-related bugs. Instead, I would do the following:

Prepend a seed argument without defaults to Simulation.__init__, so that you are forced to initialize it either with
```
sim = Simulation(random_seed=12345, …)
```
or with
```
# Initialize the RNG with a random seed
sim = Simulation(random_seed=None, …)
```
But, as I stated above, there should be no default argument for random_seed. Older codes will crash complaining for the missing parameter random_seed, but it should be easy to upgrade them.
Similarly, we should remove the default for add_white_noise and add_noise, and we should add in the docstring that the user should usually pass sim.random, unless there are specific needs not covered by the RNG created by Simulation.init_random(). On the other hand, Simulation.add_noise can stay as it is (default is None), because it automatically picks the generator initialized by Simulation.init_random().
If there are other low-level functions that need a random number generator (gain fluctuations? bandpasses?), they should add an explicit parameter asking to pass the RNG. High-level wrappers in the Simulation class can keep the parameter optional, with the default being to use sim.random.

What do you think? Does it look reasonable to you?

mreineck commented 1 year ago

Thanks @ziotom78 for the explanation about the MPI initialization of the RNG! I fully agree that this will avoid generating the same random number sequences on different tasks.

However, as far as I understand, this approach cannot solve the problem of having reproducible results when the number of MPI tasks is varied. Let's assume we want to generate a single TOD of white noise. Using one MPI task, all the noise will be generated by the RNG on task 0, producing a particular result. When switching to 2 MPI tasks, the first half of the TOD will be identical, but the second half will be different, since it was obtained from a RNG with a different seed. Maybe we are OK with that sort of inconsistency, but I wanted to make sure...

WIth 1/f noise, things become even a bit more complicated: strictly speaking, you will only get a "proper" noise stream with the specified properties if all samples are generated with one and the same noise generator (i.e. on a single task). When using multiple generators for different parts of the noise stream (e.g. when letting several MPI tasks generate one noise TOD), then there will be jumps between the individual parts, which violate the noise properties. Again, this effect may be negligible, but people can only judge that if they know that it exists :-)

ziotom78 commented 1 year ago

However, as far as I understand, this approach cannot solve the problem of having reproducible results when the number of MPI tasks is varied.

It's true that the current implementation of the code makes the actual noise stream dependent on the number of MPI processes. However, it should not be too hard to solve this because the noise generator created by Simulation.init_random is PCG, which has the nice property to be able to advance its internal state by N steps in a O(logN) time. Thus, one might modify add_white_noise so that it works in the following way:

Every MPI process creates a local copy of the global RNG in sim.simulation used by MPI process # 0, so that every process has the same generator
Process # 1 advances its own local generator by N₀ samples, where N₀ is the number of elements in the TOD of rank # 0
Process # 2 advances its own local generator by N₀+N₁ samples
Process # 3 advances its own local generator by N₀+N₁+N₂ samples

Thus, it should be possible in principle to make the noise reproducible even if one changes the number of MPI processes. But I wonder if this is a feature that is likely to be useful in the real world… I found that reproducible noise is something that helps you while debugging your code, and people are likely to run and run again their debug codes in the same configuration.

Maybe the best option is to clearly state this behavior in the documentation and the docstrings.

WIth 1/f noise, things become even a bit more complicated: strictly speaking, you will only get a "proper" noise stream with the specified properties if all samples are generated with one and the same noise generator (i.e. on a single task).

That's right. Last time I checked, this was the same behavior followed by TOAST2, which generates 1/f noise consistent only within one observation. (I believe this is mainly dictated by the fact that TOAST2 generates white noise, transforms it in Fourier space to add the 1/f slope, and then it converts the modified stream back to the time domain.)

For the LSPE/Strip pipeline, after having used the same approach as TOAST's, later we implemented a consistent 1/f noise generator that avoided jumps in the noise between the boundary of MPI processes (link), but I am not sure the penalty we paid (more complicated code, slower generation of 1/f noise) was worth the effort, as we saw no significant changes in the results of our simulations.

Probably the best solution is again to clearly state something about this in the documentation. For example, we might suggest users to be sure to assign long time spans to each MPI process when simulating 1/f noise.

mreineck commented 1 year ago

Great, if the RNG can be fast-forwarded, then the white noise problem should be easily solvable (if needed)!

In any case, I'm perfectly fine if the result depends on the number of MPI tasks, as long as we have this documented.

Creating a non-reproducible RNG anywhere in the code (i.e. initializing an RNG with np.random.default_rng()) is something I'd prefer to avoid in scientific code ... it's great for games, but not in situations where people may have to reproduce and debug a run that behaved oddly for unknown reasons.)

marcobortolami commented 1 year ago

Solved with #256.

litebird / litebird_sim

Noise default seed discussion #255