usnistgov / SP800-90B_EntropyAssessment

The SP800-90B_EntropyAssessment C++package implements the min-entropy assessment methods included in Special Publication 800-90B.
202 stars 87 forks source link

Use of rand within Fisher-Yates #54

Closed joshuaehill closed 5 years ago

joshuaehill commented 6 years ago

rand() can actually output RAND_MAX, so your code r = (rand() / (float)RAND_MAX) * (i + 1); Can output i+1 on all compilers. Further, casting in C++ (and in C) does not actually uniquely define the result of casting a float to an integer; For values where the exact value can't be represented, the compiler can provide the value on either side, so there may well be a large number of values that result in r being the value i+1. This isn't a valid result in this context (for example, this would result in a buffer overflow for the first swap).

Further, this approach isn't likely to actually yield the desired uniform distribution.

I suggest either using the easy to understand (but slow) double mod approach (wherein you first establish an upper bound based on a modular reduction of the output range size mod the desired bound, and then reduce the final random value mod the desired bound), or the very fast and pretty (but opaque) approach recently developed by Lemire in "Fast Random Integer Generation in an Interval": https://arxiv.org/abs/1805.10941

Finally, (and this is getting into the weeds) you may want to not use the built-in rand function, as that commonly resolves to a linear congruential generator, which isn't that great of an RNG. There are a number of excellent alternatives, including using a Mersenne Twister or xoshiro256**.

celic commented 6 years ago

I received comments on rand() and would agree it should be swapped out. MT would be a good approach.

joshuaehill commented 6 years ago

MT is non trivial to implement correctly, and I'm not sure what license requirements that NIST operates under. I'll submit a pull request that uses xoshiro256**, just because that's small, easy, quick, and in the public domain.

If you have access to a MT implementation that is appropriate for use in this context, that would also be a good (but slower) choice.

I'll also include an implementation of Lemire's generation approach.

joshuaehill commented 6 years ago

Incidentally, the following is a fairly good summary of a bunch of approaches, together with some notes on their performance: http://www.pcg-random.org/posts/bounded-rands.html