ircmaxell / RandomLib

A library for generating random numbers and strings
MIT License
843 stars 115 forks source link

7 character base62 collision #38

Closed kbond closed 9 years ago

kbond commented 9 years ago

Hello, I am using this library to generate unique 7 character base62 strings. My code is similar to the following:

$charset = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';

return (new Factory())->getMediumStrengthGenerator()->generateString(7, $charset);

After seeing this issue: https://github.com/ramsey/uuid/issues/80 I was wondering if my system would have collisions as well. I tried a variation of this script and indeed, I had a collision after about 1.2 million generated strings. It is my understanding that there are over 2 trillion possibilities with my system.

I tried increasing the length to 10 (which would give like 400 quadrillion possibilities) and after 12 million keys there were no collisions.

I'm trying to understand why 7 isn't good enough.

Thanks.

ircmaxell commented 9 years ago

Because of the https://en.wikipedia.org/wiki/Birthday_problem

7 characters is approximately 42 bits. From that table you will have a 50% chance of collision after just 12000 generated.

Increasing to 10 characters increases entropy to 60 bits. Which increases that 50% number to around 3 million.

Longer is better in general...

Note that uuids are normally 128 bit, which is well outside of possibility of collision for any reasonable generated set.

Anthony On Sep 5, 2015 8:00 AM, "Kevin Bond" notifications@github.com wrote:

Hello, I am using this library to generate unique 7 character base62 strings. My code is similar to the following:

$charset = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';return (new Factory())->getMediumStrengthGenerator()->generateString(7, $charset);

After seeing this issue: ramsey/uuid#80 https://github.com/ramsey/uuid/issues/80 I was wondering if my system would have collisions as well. I tried a variation of this script https://github.com/ramsey/uuid/issues/80#issuecomment-135472796 and indeed, I had a collision after about 1.2 million generated strings. It is my understanding that there are over 2 trillion possibilities with my system.

I tried increasing the length to 10 (which would give like 400 quadrillion possibilities) and after 12 million keys there were no collisions.

I'm trying to understand why 7 isn't good enough.

Thanks.

— Reply to this email directly or view it on GitHub https://github.com/ircmaxell/RandomLib/issues/38.

kbond commented 9 years ago

Great, thanks for the detailed explanation.

kbond commented 9 years ago

Anthony, can you tell me how you make those calculations?

timoh6 commented 9 years ago

A practical way to look into this is to calculate how much resistance there is before finding a collision (for strong randomness).

This can be calculated as 2^(bits/2), ie. if you have 60 bits of entropy, you would see a collision after ~2^30 outputs.

As your charset length is 62 characters, the entropy it gives (per character) is log2(62) and thus the 10 character string would give 10 * log2(62) bits of entropy.

With this information you can calculate the "upper bound" of outputs before a collision approximately occurs:

2^(10 * log2(62) / 2) = 2^29.77098 = 916 131 846

So it's no wonder there were no collision in the set of 12 million keys.

But with the 7 character code the resulting number would be ~1 876 595, so you got a bit "lucky" by hitting a collision already in 1.2 millions tries (however, with big enough numbers luck doesn't practically matter anymore).

kbond commented 9 years ago

Awesome, thanks @timoh6!