llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.03k stars 11.58k forks source link

uniform_real_distribution isn't uniform. #23542

Open llvmbot opened 9 years ago

llvmbot commented 9 years ago
Bugzilla Link 23168
Version unspecified
OS All
Reporter LLVM Bugzilla Contributor
CC @jrmuizel,@mclow

Extended Description

Looking at the implementation of uniform_real_distribution it's quite obvious that it isn't respecting the pigeonhole principle. And since lots of other distributions are built on top of uniform_real_distribution, all those other distributions are broken too.

Here's a very simple test:

$ cat urd.cxx

include

include

include

include

int main() { std::default_random_engine gen; double from = 0x1p52, to = from + 2; std::uniform_real_distribution dis(from, to); long long bucket[3] = { 0 }; int i;

    for (i = 0; i < 1000000; i++) {
            double r = dis(gen);
            unsigned int b = (int)(r - from);
            if (b >= 3) {
                    printf("foo: %d %f\n", b, r);
            }
            assert(b < 3);
            bucket[b]++;
    }

    for (i = 0; i < 3; i++) {
            printf("%lld\n", bucket[i]);
    }

    return 0;

} $ c++ -v Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn) Target: x86_64-apple-darwin14.1.0 Thread model: posix $ c++ -std=c++11 -o foo urd.cxx && ./foo 249701 500046 250253

I looked through the code available on the svn trunk on the web and it hasn't changed enough to fix this.

The reason 0x1p52 is the start of the range is because at [2^52,2^53) doubles have exactly integer precision which allows the generator only three possible output values. Any other range could be chosen and display the same problem as long as we constrict the generator to few (non 2^n) outputs. Three possible outputs will display the greatest bias in the output.

GCC libstdc++ is broken too in the exact same way.

FYI. A while ago I started an attempt to make a decent solution to a small part of this problem because it's harder than it looks: https://github.com/art4711/random-double/blob/master/rd.c

llvmbot commented 9 years ago

Correction to the report.

  1. Even 2^n outputs show a nasty bias.
  2. A better link to my experiments for how to solve this is https://github.com/art4711/random-double because I extended my experiments to arbitrary ranges in a different file (with lots of comment explaining the reasoning).