rust-random / rand

A Rust library for random number generation.
https://crates.io/crates/rand
Other
1.67k stars 432 forks source link

thread_rng()..gen_range seems to produce numbers close together #1512

Closed mdavisJr closed 1 month ago

mdavisJr commented 1 month ago

Summary

A clear and concise description of what the bug is. I don't think this is a bug but it kind of seems that a lot of the random numbers are clumped together. Like if I get 5 random numbers between 1 and 80 and I do a check to see if any of the random 5 numbers are within 3 of each other, it is mostly like going to be true.

What behavior is expected, and why?

I would expect the numbers to be more spaced out. Like maybe 2, 28, 34, 55, 79.

Code demonstrating the problem


for loop_idx in 1..=10{
      let mut rng = thread_rng();
      let mut arr = Vec::::new();
      for _ in 1..=5 {
          arr.push(rng.gen_range(min..=max));
      }
      arr.sort();
      let r = check_within(3, &arr);
      if r > 0 {
          direct_bunch_count += 1;
      }
      println!("{}.\t{:?}--{}", loop_idx, arr, r); 
  }
  println!("");
  println!("");

  println!("Direct Count: {}/{}", direct_bunch_count, 10);

min = 1, max = 80

1.      [14, 26, 52, 60, 75]--0
2.      [4, 7, 16, 53, 73]--1
3.      [5, 37, 62, 62, 66]--1
4.      [9, 12, 49, 50, 72]--2
5.      [27, 53, 60, 65, 69]--0
6.      [24, 37, 37, 54, 77]--1
7.      [22, 24, 50, 69, 72]--2
8.      [1, 12, 65, 72, 78]--0
9.      [24, 25, 35, 78, 80]--2
10.     [9, 12, 30, 53, 70]--1

Direct Count: 7/10

min = 1, max = 40

1.      [4, 7, 11, 16, 39]--1
2.      [21, 23, 23, 26, 27]--4
3.      [15, 26, 27, 31, 35]--1
4.      [10, 16, 17, 21, 28]--1
5.      [6, 21, 22, 25, 25]--3
6.      [13, 18, 23, 29, 39]--0
7.      [20, 24, 25, 35, 36]--2
8.      [6, 17, 19, 27, 39]--1
9.      [11, 21, 25, 31, 32]--1
10.     [1, 8, 13, 15, 38]--1

Direct Count: 9/10

As you can see when min = 1 and max = 80, 7/10 sets have numbers within 3 when min = 1 and max = 40, 9/10 sets have numbers within 3. I know that thread_rng()..gen_range uses Uniform...Is there another struct that will give me the behavior that I'm expecting.

newpavlov commented 1 month ago

https://en.wikipedia.org/wiki/Clustering_illusion

mdavisJr commented 1 month ago

I could be wrong but don't feel that it is an illusion when I have test case that produces around the same results listed in the post when you run the test case above.

Even if I raise the loop_count from 10 to 1000, when min = 1 and max = 40 the count is:

Direct Count: 883/1000

When min = 1 and max = 80 it gets a little better and goes to

Direct Count: 633/1000

benjamin-lieser commented 1 month ago

Assuming the numbers are perfectly random and independently sampled from each other, how much of close numbers do you expect? (This is not super trivial to calculate, but this number exists)

It is only a bug if you observe a different number with rand than what would theoretically be expected.

newpavlov commented 1 month ago

@mdavisJr Try a similar code with any other language/library, or better compute expected value using statistical theory.

ThreadRng uses CSPRNG seeded with system randomness and you can see sampling implementation here (IIRC it contains statistically insignificant bias, but it should not be important in your case).

dhardy commented 1 month ago

Lets see if I understand your problem exactly...

  1. You sample integers between 1 and 40 (inclusive), then check whether the difference from the prior sample is no greater than 3. Ignoring the ends of this range, this means that for any sample, the next sample has a 7/40=17.5% chance of being "within 3". Accounting for the ends, this reduces to 16.75%.
  2. You sample five times, thus four chances of getting a "within 3". Thus, the chance of not doing this is (1 - 0.1675)^4 = 48% (approx). One could therefore expect 52% of 1000 samples (520) to hit this metric, yet you counted 883.
newpavlov commented 1 month ago

@dhardy I think your math is wrong. You calculate independent sample pairs, but we have 5 sampled values for which the value is calculated. It's the same mistake which results in the birthday paradox.

dhardy commented 1 month ago

It would have helped if he would have included source for fn check_within. If it's just checking that any two numbers are separated by no more than 3, then the result makes sense. You need to use almost half the available range just to avoid any "close" results (1, 5, 9, 13, 17).

benjamin-lieser commented 1 month ago

It would have helped if he would have included source for fn check_within. If it's just checking that any two numbers are separated by no more than 3, then the result makes sense. You need to use almost half the available range just to avoid any "close" results (1, 5, 9, 13, 17).

He sorted before ;)

dhardy commented 1 month ago

Yes, that sort makes a huge difference: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=af19c837fa92b0f3c0e25efe30a49e8a

mdavisJr commented 1 month ago

So I guess it is normal that if I get 5 random numbers between 1 and 40, 8 to 9 times out of 10 2 of those 5 numbers are going to be within 3? Is there anyway to prevent this behavior or is it just the way randomness works?

WarrenWeckesser commented 1 month ago

Is there anyway to prevent this behavior or is it just the way randomness works?

You might be interested in learning about quasi-random (or low discrepancy) sequences.

mdavisJr commented 1 month ago

By the way love you guys' work and appreciate the quick feedback. Do you guys have any plans to add quasirandom sequences to this library or any plans on creating a new library that generates quasirandom sequences?

dhardy commented 1 month ago

No. #182 was related.

The main problem is that our algorithms (Uniform, Bernoulli, shuffling, non-linear sampling, ...) assume that the underlying generator is uniform. If this assumption is dropped, it's hard to say what exactly would happen without case-by-case analysis (e.g. your 1..=40 range may no longer be uniformly sampled on aggregate, or distribution of results might appear much the same — you get very different results if you use x % 40 instead of x / ($ty::MAX / 40).

Feel free to make your own non-uniform generator (RngCore) and plug it into rand's algorithms, but don't complain if the results are unexpected.