libc++: Wrong Initial Distribution When Varying the Seed

jonas-eschmann commented 1 week ago

The default random engine (std::minstd_rand0) produces a highly correlated initial distribution when varying the seeds (tested with clang 19.1.0 and seems to be the same for all prior versions):

#include <iostream>
#include <random>

int main(){
    for(int seed = 0; seed < 100; seed++){
        auto rng = std::default_random_engine(seed);
        float value = std::uniform_real_distribution<float>(0, 1)(rng);
        std::cout << value << "\n";
    }
    return 0;
}

Output

...
0.000101742
0.000109569
0.000117395
0.000125221
0.000133048
0.000140874
0.000148701
0.000156527
0.000164353
0.00017218
0.000180006
0.000187832
0.000195659
0.000203485
0.000211312
0.000219138
0.000226964
0.000234791
0.000242617
0.000250443
0.00025827
0.000266096
0.000273922
0.000281749
0.000289575
0.000297402
0.000305228
0.000313054
0.000320881
0.000328707
0.000336533
0.00034436
...

https://godbolt.org/z/x7jrEcaME

I think users would expect this to be uniformly distributed. GCC has the same issue but MSVC seems fine

keinflue commented 1 week ago

You are using libstdc++ in the godbolt link and for the results you show here. You need add -stdlib=libc++ to use libc++ instead. However, the result will look similar.

keinflue commented 1 week ago

The random number sequence produced for a specific seed is fully determined by the specification in the standard. There is no room for an implementation to produce a different result.

Room for different implementation behavior is only in the algorithm used to produce the results of uniform_real_distribution from the sequence of random numbers produced by the generator itself. However, if the values produced by the generator are already bad quality, then there isn't much to improve there either.

The minstd_rand0 generator needs warmup and doesn't produce high-quality results anyway.

The reason it seems better in MSVC is that the Microsoft STL uses std::mt19937 (the Mersenne twister) instead of minstd_rand0 as default random engine. However, it also requires warmup to avoid correlated results.

jonas-eschmann commented 1 week ago

@keinflue thanks a lot for the clarification!

I think this behavior is quite dangerous and I suspect it affects quite many users, most of who will probably never debug/drill down and just get bad statistical results. Is there a reason not to use std::mt19937 as the std::default_random_engine? This would probably save a lot of people a lot of time and/or save them from poor statistical results

keinflue commented 1 week ago

An actual libc++ developer will have to answer that. I am only a user as well.

However, I would only ever consider using default_random_engine in situations where quality really doesn't matter much and I just need some vaguely random data ad-hoc. The problem with the Mersenne twister in these scenarios is that it is much much slower than minstd_rand0. I basically see default_random_engine as a replacement for the old rand()/srand(), which also has effectively no quality guarantees and is traditionally implemented with speed over quality in mind. In these cases the seed is usually randomly chosen as well, e.g. from std::random_device, avoiding the problems you see here.

Anything where reproducibility is relevant (e.g. in scientific problems), I would always be explicit about the PRNG being used and some time needs to be spent to verify whether a given PRNG is appropriate for the task as well. (Btw., this usually means not using the <random> distribution functions either, because they use different algorithms in different implementations.)

philnik777 commented 1 week ago

Which engine the default_random_engine is is part of the ABI, so we can't change it even if we wanted to, at least in the stable ABI. I don't know whether we want to change it in the unstable ABI, since I don't know the reasoning for the current choice.

jonas-eschmann commented 1 week ago

After thinking about this again, I think the "silent" behavior of generating a wrong distribution is much more harmful on balance than the more apparent behavior of a slower default random engine. There is a clear path for people to discover the performance bottleneck through profiling, while in case of the silent failure of generating a bad distribution, the path to identify it as the culprit for downstream problems is very domain-specific and might easily go unnoticed while still producing (slightly) biased results.

llvm / llvm-project

libc++: Wrong Initial Distribution When Varying the Seed #114829