stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

`create_rng` with chain=0 appears to return biased first draw #3167

Closed cescalara closed 1 year ago

cescalara commented 1 year ago

Summary:

When using poisson_rng inside the transformed parameter block, the results do not follow the expected Poisson distribution. The issue only seems to appear for rate values of lambda < 10, e.g. poisson_rng(9.9). The issue only appears inside the transformed data block, and using poisson_rng inside the generated quantities block is fine.

Description:

If I compile and run the Stan code shown below as a simulation (i.e. fixed_param=True) via cmdstanpy I get very different output depending on which block the function is called. When calling poisson_rng in the transformed data block, the output doesn't make sense. I also add the python code I use and a plot demonstrating this.

Stan code: poisson.stan

data {

    real Nex;

}

transformed data {

    int N_td;

    N_td = poisson_rng(Nex);

}

generated quantities {

    int N_out_td;
    int N_gq;

    N_out_td = N_td; 
    N_gq = poisson_rng(Nex);

}

python code:

import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
from cmdstanpy import CmdStanModel

poisson_model = "poisson.stan"
model = CmdStanModel(stan_file=poisson_model)

data = {"Nex": 9.9}
output_N_td = []
output_N_gq = []

for _ in range(100):
    seed = np.random.randint(1, 10000)
    sim = model.sample(data=data, iter_sampling=1, chains=1, fixed_param=True, seed=seed)
    N_td = sim.stan_variable("N_out_td")[0]
    N_gq = sim.stan_variable("N_gq")[0]
    output_N_td.append(N_td)
    output_N_gq.append(N_gq)

fig, ax = plt.subplots()
max_x = 30
x = np.arange(max_x)
bins = np.linspace(0, max_x, max_x+1)-0.5
ax.hist(output_N_td, bins=bins, density=True, label="Output from transformed data", alpha=0.5)
ax.hist(output_N_gq, bins=bins, density=True, label="Output from generated quantities", alpha=0.5)
ax.scatter(x, stats.poisson(data["Nex"]).pmf(x), color="k", label="Poisson PMF")
ax.set_title("Histograms of 100 samples, Nex=9.9")
ax.legend()
fig.savefig("figures/stan_poisson_db.png")

Figure: stan_poisson_db

Additional Information:

I often use poisson_rng in the transformed data block to implement a simulation of mock data from a Poisson process.

Current Version:

INSTALLED VERSIONS
---------------------
python: 3.9.12 (main, Jun  1 2022, 11:38:51) 
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-135-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
cmdstan_folder: /home/iwsatlas1/capel/.cmdstan/cmdstan-2.31.0
cmdstan: (2, 31)
cmdstanpy: 1.1.0
pandas: 1.4.3
xarray: 2022.6.0
tqdm: 4.64.0
numpy: 1.21.5

I'm curious if others are able to reproduce this, or I am missing something. In case this is the wrong place to post the issue, I can also move it elsewhere. Thanks a lot!

WardBrian commented 1 year ago

Very odd. I can confirm I can reproduce the behavior you see.

Investigating now

WardBrian commented 1 year ago

Hm, inserting an extra rng call in the tranformed data block changes the behavior:

data {
  real Nex;
}
transformed data {
  real ignored = uniform_rng(0,1);

  int N_td;

  N_td = poisson_rng(Nex);
}
generated quantities {
  int N_out_td;
  int N_gq;

  N_out_td = N_td;
  N_gq = poisson_rng(Nex);
}

I suspect the bug you've found is not in cmdstanpy, but somewhere deeper in Stan. I will transfer this issue when I figure out where it should go!

WardBrian commented 1 year ago

Looking at the code now, it seems we

1) use a different RNG (with the same seed) inside the model constructor as everywhere else
2) do not supply the chain ID to advance the RNG in the constructor, so all chains would have the same RNG in the transformed data block (we just pass chain id 0).

Both of those seem like potential issues, but neither really explain whatever is going on here. (Edit: 2 is intentional: https://discourse.mc-stan.org/t/transformed-data-rng-varies-by-chain/569)

I've also confirmed that setting the RNG to any other chain (by editing poison.hpp) fixes the problem, and setting id=0 breaks the RNG for generated quantities as well.

Very odd!

WardBrian commented 1 year ago

Oh also, it has nothing to do with poisson - here's the same observation using uniform_rng(0,1):

Figure_1

cescalara commented 1 year ago

Thanks for looking into this! Happy to help out if I can. I thought it was particularly weird that setting lambda>10 seemed to fix things...

WardBrian commented 1 year ago

It seems like this issue may be even below Stan, but it's possible we can fix it ourselves.

Here's the equivalent of what we're doing in pure C++ with Boost: https://godbolt.org/z/h98Gbdf9j

If you change out seeds, you'll see that the first value printed is always very high, but the following ones look normal.

Very odd! We're still looking into this but we will definitely make sure something is changed for the next version of Stan! For now, inserting a dummy real = uniform_rng(0,1) call seems to fix subsequent things.

WardBrian commented 1 year ago

I've transferred this to the Stan repository because this is most likely where it will be fixed, by editing https://github.com/stan-dev/stan/blob/develop/src/stan/services/util/create_rng.hpp

ahartikainen commented 1 year ago

If you change out seeds, you'll see that the first value printed is always very high, but the following ones look normal.

In Stan or in C++?

WardBrian commented 1 year ago

Just in C++. The godbolt link above shows what I'm describing using only the boost parts we use in Stan.

ahartikainen commented 1 year ago

Just in C++. The godbolt link above shows what I'm describing using only the boost parts we use in Stan.

Are you sure? I did not see anything peculiar?

WardBrian commented 1 year ago

If you vary the seed, you'll see things like:

Seed: 233

Program stdout

12
2
2

The first number printed is usually >10 regardless of seed, which is quite unlikely for poission(3). Furthermore, if I change the code so that there is always an rng.discard(1) call, this behavior vanishes

WardBrian commented 1 year ago

Hmm, the behavior is even weirder, and it seems to also depend on small seeds.

The original python code used a random number between 1 and 10,000 as a seed. If you make this much bigger (I used seed = np.random.randint(10000, 10000000)) then the graphs start looking correct.

ahartikainen commented 1 year ago

Ok, now I see it

WardBrian commented 1 year ago

Here's another godbolt which shows the issue more precisely: https://godbolt.org/z/e9veEn5T5

bob-carpenter commented 1 year ago

Could you also file an issue with Boost? I know we'll need to work around this in the short term, but it's probably something they want to know, too, as giving things small seeds by hand is a very common practice. My guess is that it's a bad interaction between the linear congruential pRNG we use and the implementation of the Poisson RNG.

WardBrian commented 1 year ago

It doesn't seem to be unique to Poisson, here is the same code but with a uniform(0,1) rng: https://godbolt.org/z/zfT579dox The first draw is always close to 1, the second draw appears to be from the correct distribution.

That said, it also isn't universal - beta and normal both look correct in both situations.

bob-carpenter commented 1 year ago

That makes sense. For our pRNG, there's an underlying 256-bit representation of the state of the RNG. I don't know how state is initialized from a 32-bit uint seed, but my guess is that a small number will somehow fail to get a good 256-bit initialization for the way the pRNGs are implemented for Poisson, uniform, etc.

Also, I had no idea C++11 had a lot of this implemented. For example,

https://en.cppreference.com/w/cpp/numeric/random/poisson_distribution

WardBrian commented 1 year ago

Boost issue opened.

The easiest work-around for us seems to be just always discarding at least 1 draw when we create the RNG. If we want to maximize backwards compatibility of seeds, we can add 1 to the discard amount if and only if the chain ID is 0, otherwise we can just unconditionally add 1 and state in the release notes that random behavior will have changed this version for seeded runs.

bob-carpenter commented 1 year ago

@mitzimorris said this issue may have been introduced relatively recently when we changed the default ID of the first chain from 0 to 1 (to avoid having to advance when not required, but it looks like it was required!).

WardBrian commented 1 year ago

I think changing the default ID to 1 actually made this bug less common. Now it is only apparently in two scenarios

  1. The user manually sets the ID to 0 and then runs something like standalone generated quantities with a poisson or uniform variable
  2. The constructor of the model class hard-codes the chain ID to 0, so the transformed data block will see this for those distributions
bob-carpenter commented 1 year ago

Sorry, I got that backwards. It changed from 1 to 0.

bob-carpenter commented 1 year ago

Also, thanks much for reporting, @cescalara. We're all still marveling at how you managed to discover this bug given how subtle and seed-dependent it is.

WardBrian commented 1 year ago

Looking through the history, it used to discard DISCARD_STRIDE * (chain - 1), until this PR in 2017: https://github.com/stan-dev/stan/pull/2313, when it just discarded chain as it does now. The multi-chain PR https://github.com/stan-dev/stan/pull/3033 edited this file while in progress but ultimately kept this behavior.

I believe the model class has used chain id 0 since the era of that 2017 PR, so the behavior would have been noticeable as long ago as that.

rok-cesnovar commented 1 year ago

Until we get a boost bugfix, my vote would be to discard std::max(1, DISCARD_STRIDE * chain) and announce it in release notes and everywhere else. Once/if boost does a bugfix, we then switch back to what we currently have.

The only drawback (at least that I see) is backward compatibility for anyone using chain ID = 0. But anyone using that would have been using a biased rng and there is no reason to leave a bug in for backward compatiblity.

kotika commented 1 year ago

Noone should be using ecuyer1988 RNG as it is obsolete. The replacement boost::ecuyer1988 -> boost::random::mixmax should solve your issues. MIXMAX is a modern, industrial strength RNG. It has been made the default in major physics Monte-Carlo packages, GEANT4 and CMS-SW software. https://github.com/Geant4/geant4/blob/master/source/externals/clhep/src/MixMaxRng.cc https://github.com/cms-sw/cmssw/pull/21339/commits/e79b4675d9d9907c053e4e1a99089271ad6bdeb4

Along with sequences indistinguishable statistically from true random numbers, attention has been paid to seeding, such that the sequence obtained from seed(1) starts out perfectly randomized and is completely independent from seed(2), seed(3) etc, including the case if you want to test the quality by interleaving the two or more streams.

WardBrian commented 1 year ago

It appears MIXMAX defines discard as

void discard(boost::uint64_t nsteps) { for(boost::uint64_t j = 0; j < nsteps; ++j)  (*this)(); } ///< discard n steps, required in boost::random

We regularly call discard with an argument on the order of 2^50, so this wouldn’t exactly be a drop in replacement for our use case and would require reconsidering how we use the random number generators

kotika commented 1 year ago

@WardBrian , @bob-carpenter Most RNG work on a single trajectory, where seeding can place the state of the RNG at a "random" point.
The RNG called ecuyer1988 has the period of about 10^18, so advancing on it for 2^50 steps would exhaust the period and create the danger that random numbers simply repeat between different draws. Moreover, it is certain that with a generator like ecuyer1988 will produce identical stream of random numbers from pairs of completely different seeds (with a lag of e.g. 1).

In MIXMAX, the trajectory is astronomically long and the state of seed(k+1) is the state of seed(k) advanced by exactly 2^512 steps. Perhaps if I understood what task you accomplish with discard, I could find some solution for you.

WardBrian commented 1 year ago

The RNG called ecuyer1988 has the period of about 10^18

Table 32.6 in the Boost documentation claims ecuyer1988 has a period of approximately 2^61

Your comment https://github.com/boostorg/random/issues/92#issuecomment-1446210886 is correct about why we discard, it is for parallel MCMC chains. We currently create each chain with the same seed but discard (2^50 * chain_id) elements.

The behavior you're describing for MIXMAX sounds like something we could reasonably use. If I'm understanding correctly we could then seed each chain with (user_supplied_seed + chain_id)