wax-office-of-inspector-general / wax-developer

WAX Developer Portal - Learn, Build, Operate & Create
https://developer.wax.io
10 stars 55 forks source link

proposed COMMENTS on changes to rng_sample.md warning about modulo bias #139

Closed Mushini closed 1 year ago

Mushini commented 1 year ago

Added comments about the following: max_value isn't actually the max value the code suffers heavily from modulo bias the article does not mention modulo biases

I'd suggest changing random_int to a uint64_t and assigning 8 bytes instead of 1; changing num1 to better reflect the notion of a "max_value"; and rewriting the article to warn against small datatypes and modulo biases (possibly with a link on the subject).

The WAX RNG Tutorial rng_sample gives an example of getting a random number (0-99) using a byte. But it doesn't mention modulo bias anywhere. uint8_t random_int % 100 gives a +50% bias to the first 56 numbers (0-55); a byte isn't large enough to mitigate modulo bias in this case. Modulo bias is an important subject when it comes to RNG, it can be eliminated with rejection sampling, or proper modulo boundaries; or it can be mitigated with large storage space relative to modulus number (e.g. if random_int's data type were uint64_t).

Using a single byte for this would be a bad practice to follow and teach. And I think it might be good for the tutorial to mention modulo biases.

Also it says the max_value is 100 but the max value is probably actually 99. (the largest X % 100 can be is 99).

(Also, the Twitter link in the site's footer leads to the discord, not twitter.)

3dkrender commented 1 year ago

Yes, it is true that using a number of only 8 bits the rate of appearance of numbers from 0-55 is 33% higher than the rest. I've modified the code to extract a 16-bit portion and reduce the module bias. I have also changed the name of some variable to make it easier to read and understand.

The new code can be seen here: https://github.com/3dkrender/WAX-RNG-Test/commit/1d1230f5fbd473c169251f620b264b95f1bcad78

Mushini commented 1 year ago

Thanks 3dkrender. The bias/bonus % I gave is the difference between the biased slots (0-55) and the remaining slots (56-99); It is also the difference between min and max. I guess it's meant to reflect an advantage. With a 16-bit random_chunk (and divisor=100) the first 36 numbers (0-35) each have an increased 0.15% appearance rate over the remaining numbers (36-99). I.e. low numbers have a 0.15% advantage. I have created a script that estimates the distribution given the max_size of random_chunk and a divisor. It also simulates the rolls (using max_size and divisor).

Note: these estimates are specific to modulo 100; the larger the divisor the more bias might be introduced Estimate(7,100): 28.0000% bias distributed among the first 28 slots = +100.00% per slot (and 56.25% higher than they "should be.") That is, they are 100% more common than the remaining slots; should have a 1% occurrence but have a 1.5625% occurrence rate. Estimate(8,100): 28.0000% rollover distributed among the first 56 slots = +50.00% per slot. Estimate(9,100): 2.4000% rollover distributed among the first 12 slots = +20.00% per slot. Estimate(16,100): 0.054962% rollover distributed among the first 36 slots = +0.152672% per slot. ... 16-bit would give a 0.15% advantage to the first 36 (of 100) slots, according to the script's estimate.

A simulation (with prng) agrees with the estimates. The advantaged (first) and disadvantaged (remaining) slot values have a 3:2 ratio. [117629, 116908, 117805, 116994, 117325, 117222, 117301, 116825, 117209, 117226, 117509, 116902, 117180, 116959, 117054, 117258, 117271, 116855, 117045, 116822, 116827, 117337, 117681, 118018, 117899, 116841, 117433, 117274, 117466, 117203, 117528, 117369, 117353, 116943, 117720, 116872, 117489, 116859, 117366, 117220, 117222, 117677, 116890, 116833, 117050, 117089, 116987, 117223, 117669, 117172, 117473, 116939, 116496, 117622, 116594, 117042, 78012, 77736, 77997, 78200, 78368, 78115, 77603, 77756, 77750, 77893, 77895, 78194, 78055, 78281, 78257, 78616, 78178, 78032, 78068, 78557, 78258, 77800, 77931, 78083, 78091, 78243, 78054, 77622, 78130, 78559, 78123, 78122, 78042, 77818, 78200, 78415, 78465, 78070, 77537, 78166, 78239, 78059, 78024, 78411]

An example of breaking down the math for estimating biased distribution: 128 split between 28 and 72 slots biased version: (128 mod 100) 28 points go to the first 28 slots 72 points go to the last 72 slots The final 28 points go to the first 28 slots (from rollover 128-100) this extra 28 points would've normally/ideally been distributed evenly across all 100 slots (+28/100 to each slot). 56/128 = 0.4375 distributed among the first 28 slots (the first 28 slots have 56 points) 72/128 = 0.5625 distributed among the last 72 slots (the last 72 slots have 72 points) 43.75%/28 = 1.5625% for each of the first 28 slots 56.25%/72 = 0.78125% for each of the remaining 72 slots (unbiased would be 1% per slot for the first 28 slots and remaining 72 slots)

unbiased version: (128 evenly distributed into 100, e.g. 128/100) 128/100 = 1.28 per slot; 1.28 out of 128 (1.28/128) = 1% 28% of 128 (35.84 points) goes to the first 28 slots 72% of 128 (92.16 points) goes to the remaining 72 slots Each of the 100 slots has 1.28 (128 / 100). Meaning a value of 1.28 is 1% 35.84 / 28 = 1.28; / 128 = 0.01 92.16 / 72 = 1.28; / 128 = 0.01 OR: 35.84/128 = 0.28 92.16/128 = 0.72 etc.

Even when using ORNG and the Wax RNG Service, modulo biases are something dapp developers should be aware of while developing with RNG. I suspect using 2 or more bytes would mitigate this issue in most cases. There are cases where using a double byte for random_chunk could still be especially problematic. Using a single byte with divisor=100 is especially undesirable. If you wanted a range of 128 numbers (0-127, or 1-128, etc.) or 64 numbers, etc. I believe it would be fine & lacking bias. The problem is caused by misalignment and the resulting "slack" (basically the "remainder"); 100 is not a factor of 128. It is mitigated when using a large datatype for random_chunk because it becomes rarer for the value of random_chunk to be a very large number (>=65500 for a 16-bit uint). So you should see that the chance of a "bad roll" becomes smaller with more bits available to random_chunk: 28/128, 56/256, 12/512, ..., 36/65536

Just to clarify, estimates following from this theory: (16 bits, mod 100) = ~549 out of 1_000_000 rolls are rollover "bad" rolls. 0.15% relative "advantage" to each rollover slot. (8 bits, mod 100) = 218_750 out of 1_000_000 rolls are rollover "bad" rolls. 50% advantage. (7 bits, mod 100) = 218_750 out of 1_000_000 rolls are rollover "bad" rolls. 100% advantage. (8 bits, mod 256) = 0 out of 1_000_000 rolls are rollover "bad" rolls. 0 advantage. (8 bits, mod 64) = 0 out of 1_000_000 rolls are rollover "bad" rolls. 0 advantage. (16 bits, mod 1000) = ~8_178 out of 1_000_000 rolls, 0.153846% advantage to each of the first 536 slots (out of 1000).. (16 bits, mod 1024) = 0 out of 1_000_000 rolls...

rakeden commented 1 year ago

Please go a ahead with this PR, so that I could merge it into the docs @3dkrender @Mushini

rakeden commented 1 year ago

Thanks for this upgrade/addition! @Mushini