lemire / fastrand

Fast random number generation in an interval in Python: Up to 10x faster than random.randint.
Apache License 2.0
88 stars 13 forks source link

How pcg32_random_bounded_divisionless() work ? #3

Closed achan001 closed 6 years ago

achan001 commented 6 years ago

I peek at the code pcg32_random_bounded_divisionless().
Why variable 'leftover' able to reject unbiased randoms ?

No bias if leftover >= range or leftover(updated) >= threshold.
I don't follow above logic ...

Does this produce unbiased ranged number ?
Or, is it an approximation, using your idea in another blog:

https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/

achan001 commented 6 years ago

Why variable 'leftover' able to reject unbiased randoms ?

Uh, I meant to reject biased randoms

random32bit =  pcg32_random(); 
multiresult = random32bit * range; 
leftover = (uint32_t) multiresult; 
return multiresult >> 32;             // [0, range)

Returned result look just like your modulo-reduction trick.

As you mentioned in the blog, above code work very well, especially for small range. So, the question is how much the added sample rejection code contributed.

variable 'leftover' seems like a random number (MCG) to me ...

lemire commented 6 years ago

So, the question is how much the added sample rejection code contributed.

The question we ask is "given that we have a perfect source of random numbers, can you prove mathematically that the ranged result is unbiased". There is no "how much" question in this context. Results are either biased or unbiased.

A separate question is whether the bias matters. Well. That depends, doesn't it? Some applications might be super sensitive to biases, others might not care.

achan001 commented 6 years ago

So, the question is how much the added sample rejection code contributed

I phrase it that way because the code will generate bias, even with true random source. Even your comment for pcg32_random_bounded_divisionless() say so. (shuffle.c, line 55)

// map random value to [0,range) with slight bias, redraws to avoid bias if needed static inline uint32_t pcg32_random_bounded_divisionless(uint32_t range) { ...

So, the code will return random numbers with slight bias. Sample rejection code contributed to the benefit of redrawing.

I don't follow the criteria of redrawing, and how much it contribute. The code implied that is a line between acceptable and unacceptable bias It is unclear where the line is ...

achan001 commented 6 years ago

I made two programs to show that the code seems to work. This just fills the slots using modulo-reduction trick

#include <stdio.h>              // ranged_slot.c
#include <stdlib.h>       

int main(int argc, char **argv) 
{
  unsigned range = strtoul(argv[1], NULL, 10);
  int slot[range];
  for(int i=0; i<range; i++) slot[i] = 0;
  unsigned rand = 0;
  unsigned long long rand64 = 0;    
  do {
    slot[(int)(rand64 >> 32)]++;
    rand64 += range;            // rand64 = rand * range
  } while (++rand);             // all 32-bits randoms
  for(int i=0; i<range; i++)    // show the slots
    printf("slot[%d] = %d\n", i, slot[i]);
  return 0;
}

This showed what got rejected

#include <stdio.h>              // ranged_reject.c
#include <stdlib.h>       
#define SCALED(x, n)   (unsigned) (((unsigned long long) x * n) >> 32)

int main(int argc, char **argv)
{
  unsigned range = strtoul(argv[1], NULL, 10);
  unsigned threshold = -range % range;
  unsigned rand = 0, leftover = 0;
  printf("leftover < (threshold = %u):\n", threshold);  
  do {
    if (leftover < threshold) 
      printf("0x%08x --> %u\n", rand, SCALED(rand, range));
    leftover += range;          // leftover = (rand * range) & bit32  
  } while (++rand);             // all 32-bits randoms
  return 0;
}

For range = 6, all slots (after sample rejection) are equally likely

C:\tmp>ranged_slot 6
slot[0] = 715827883
slot[1] = 715827883
slot[2] = 715827882
slot[3] = 715827883
slot[4] = 715827883
slot[5] = 715827882

C:\tmps> ranged_reject 6
leftover < (threshold = 4):
0x00000000 --> 0
0x2aaaaaab --> 1
0x80000000 --> 3
0xaaaaaaab --> 4
lemire commented 6 years ago

Given your last comment, I think I now understand your point.

lemire commented 6 years ago

Ok. So you get a biased return...

>ranged_slot 6
slot[0] = 715827883
slot[1] = 715827883
slot[2] = 715827882
slot[3] = 715827883
slot[4] = 715827883
slot[5] = 715827882

Some values are more likely than others. As I point out in my blog post, the difference is at most one between frequencies, so for small ranges, it is a "slight" bias.

Try this program (with rejection)...

#include <stdio.h>              // ranged_slot.c
#include <stdlib.h>

int main(int argc, char **argv)
{
  uint32_t range = strtoul(argv[1], NULL, 10);
  uint32_t slot[range];
  for(int i=0; i<range; i++) slot[i] = 0;
  uint32_t rand = 0;
  uint64_t rand64 = 0;
  uint32_t threshold = -range % range;
  do {
    if((rand64 & 0xFFFFFFFF) >= threshold) slot[(uint32_t)(rand64 >> 32)]++;
    rand64 += range;            // rand64 = rand * range
  } while (++rand);             // all 32-bits randoms
  for(int i=0; i<range; i++)    // show the slots
    printf("slot[%d] = %d\n", i, slot[i]);
  return 0;
}

We get an unbiased result...

$ ./ranged_slot 6
slot[0] = 715827882
slot[1] = 715827882
slot[2] = 715827882
slot[3] = 715827882
slot[4] = 715827882
slot[5] = 715827882

You seem to want me to tell you when a bias is ok. I don't have an answer to this question.

I'll close this, reopen if I have not answered to your satisfaction.

achan001 commented 6 years ago

With your revised slots + rejection program, 6 slots are all equaly likely.

Is it always true ? (all 32-bits range)

Brute force to prove it always true probably take too long. Besides, it will not explain WHY and HOW it work.

I understand that rejected cases count is same as pcg32_random_bounded() (leftover = (rand * range) >>32, is a bijection of rand, rejection rate = threshold / 2^32)

But, why the rejected cases happened to be the slots with excess counts ? It probably is not an accident.

Explanation is still welcome.

lemire commented 6 years ago

With your revised slots + rejection program, 6 slots are all equaly likely.

Yes.

Is it always true ? (all 32-bits range)

Yes.

Brute force to prove it always true probably take too long.

Definitively.

Besides, it will not explain WHY and HOW it work.

Indeed. I am working on a short paper right now. It is not yet ready. I will publish it soon.

I understand that rejected cases count is same as pcg32_random_bounded() (leftover = (rand * range) >>32, is a bijection of rand, rejection rate = threshold / 2^32)

Yes. All these algorithms rely on the same math.

But, why the rejected cases happened to be the slots with excess counts ? It probably is not an accident.

Indeed, that is no accident. It is the desired result.

Explanation is still welcome.

Email me at lemire@gmail.com and I will share the draft of my paper.

achan001 commented 6 years ago

// map random value to [0,range) with slight bias, redraws to avoid bias if needed

You might consider changing above comment (split.c, line 55) I had mis-read it to imply the code accept slight bias, redraw if bias unacceptable

This might be better:

// map random value to [0,range) without bias (by redraws if needed)

lemire commented 6 years ago

You are correct. Done.

achan001 commented 6 years ago

I think I can visualize why the code removed all bias. (Note: not a proof)

After threshold values are rejected, remaining values are divisible by range.
So, if reject correctly picked threshold values, the result have no bias.

The reason it reject the slot with excess count is because all rejected cases
are all very very close to edge of 2 slots. It almost seems as if it could fall to
the wrong slots. Edge cases just barely able to squeeze into that particular slot.

So, if we have to reject anything, pick the edge cases.

lemire commented 6 years ago

After threshold values are rejected, remaining values are divisible by range.

That's the gist of the proof.

achan001 commented 6 years ago

Thanks.

Since possible values are evenly spaced, the slot with the edge case allow most room for other values.

Uh, I think I just proved it ...

lemire commented 6 years ago

I think you did, indeed.

achan001 commented 6 years ago

To confirm my edge case idea valid, I tried top edge instead of bottom

uint32_t unbiased = ~(-range % range);    // unbiased = ~threshold
while (leftover > unbiased) ...           // rejected same slots as leftover < threshold

For unknown reason, top edge rejection sometimes run faster

range_reject.c (post 5) , range = 6:
speedup = 5.393 sec / 3.595 sec = 1.501X, or 50% faster

lemire commented 6 years ago

It seems unlikely that you'd see a 50% speed difference. I will start with the usual warning: never benchmark on a laptop... always benchmark on a server configured for testing.

I cannot reproduce your speed difference. Try this gist:

https://gist.github.com/lemire/0ead15045a4c174799338b231bacf199

$ gcc -O3 -o isflipped isflipped.c
$ ./isflipped
initbase(buffer,len,range)                                      :  4.799 cycles per operation (best)    4.875 cycles per operation (avg)
flippedbase(buffer,len,range)                                   :  4.790 cycles per operation (best)    4.799 cycles per operation (avg)
$ ./isflipped
initbase(buffer,len,range)                                      :  4.798 cycles per operation (best)    4.874 cycles per operation (avg)
flippedbase(buffer,len,range)                                   :  4.793 cycles per operation (best)    4.799 cycles per operation (avg)
$ clang -O3 -o isflipped isflipped.c
$ ./isflipped
initbase(buffer,len,range)                                      :  6.029 cycles per operation (best)    6.030 cycles per operation (avg)
flippedbase(buffer,len,range)                                   :  6.029 cycles per operation (best)    6.051 cycles per operation (avg)
$ ./isflipped
initbase(buffer,len,range)                                      :  6.029 cycles per operation (best)    6.030 cycles per operation (avg)
flippedbase(buffer,len,range)                                   :  6.029 cycles per operation (best)    6.030 cycles per operation (avg)
achan001 commented 6 years ago

ranged_reject.c (post 5) does not have a PRNG (not even slots!)

With cost of PRNG added, the effect of top edge vs bottom edge is just noise.

My goal is to confirm top edge rejected the same slots ... it does ! The speedup is just a curious observation (w/ strawberryperl.com supplied gcc 7.1.0)

achan001 commented 6 years ago

Your pre-screening test for pcg32_random_bounded_divisionless_flipped is wrong

Pre-screened cases must include all the top edge cases.
It should be like this:

if (leftover > ~range) ...     // pre-screening

If range > 0, above can be optimized a bit more:

if (leftover > -range) ...     // pre-screening, range > 0
lemire commented 6 years ago

Your pre-screening test for pcg32_random_bounded_divisionless_flipped is wrong

Right. It still does not affect the running speed.

With cost of PRNG added, the effect of top edge vs bottom edge is just noise.

Makes sense!

achan001 commented 6 years ago

You might consider adding this top edge case to your unfinished draft.
It showed that the threshold rejection test is just picking the edge cases.

Which edge cases to pick does not matter (as long as it is the same side)

lemire commented 6 years ago

@achan001 Mathematically, this needs to be pointed out, but there is no reason to have two distinct implementations unless one has benefits, somehow.

lemire commented 6 years ago

Sorry for the delay, I posted the reference tonight:

achan001 commented 6 years ago

Thank you.