Closed kbond closed 9 years ago
Because of the https://en.wikipedia.org/wiki/Birthday_problem
7 characters is approximately 42 bits. From that table you will have a 50% chance of collision after just 12000 generated.
Increasing to 10 characters increases entropy to 60 bits. Which increases that 50% number to around 3 million.
Longer is better in general...
Note that uuids are normally 128 bit, which is well outside of possibility of collision for any reasonable generated set.
Anthony On Sep 5, 2015 8:00 AM, "Kevin Bond" notifications@github.com wrote:
Hello, I am using this library to generate unique 7 character base62 strings. My code is similar to the following:
$charset = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';return (new Factory())->getMediumStrengthGenerator()->generateString(7, $charset);
After seeing this issue: ramsey/uuid#80 https://github.com/ramsey/uuid/issues/80 I was wondering if my system would have collisions as well. I tried a variation of this script https://github.com/ramsey/uuid/issues/80#issuecomment-135472796 and indeed, I had a collision after about 1.2 million generated strings. It is my understanding that there are over 2 trillion possibilities with my system.
I tried increasing the length to 10 (which would give like 400 quadrillion possibilities) and after 12 million keys there were no collisions.
I'm trying to understand why 7 isn't good enough.
Thanks.
— Reply to this email directly or view it on GitHub https://github.com/ircmaxell/RandomLib/issues/38.
Great, thanks for the detailed explanation.
Anthony, can you tell me how you make those calculations?
A practical way to look into this is to calculate how much resistance there is before finding a collision (for strong randomness).
This can be calculated as 2^(bits/2), ie. if you have 60 bits of entropy, you would see a collision after ~2^30 outputs.
As your charset length is 62 characters, the entropy it gives (per character) is log2(62) and thus the 10 character string would give 10 * log2(62) bits of entropy.
With this information you can calculate the "upper bound" of outputs before a collision approximately occurs:
2^(10 * log2(62) / 2) = 2^29.77098 = 916 131 846
So it's no wonder there were no collision in the set of 12 million keys.
But with the 7 character code the resulting number would be ~1 876 595, so you got a bit "lucky" by hitting a collision already in 1.2 millions tries (however, with big enough numbers luck doesn't practically matter anymore).
Awesome, thanks @timoh6!
Hello, I am using this library to generate unique 7 character base62 strings. My code is similar to the following:
After seeing this issue: https://github.com/ramsey/uuid/issues/80 I was wondering if my system would have collisions as well. I tried a variation of this script and indeed, I had a collision after about 1.2 million generated strings. It is my understanding that there are over 2 trillion possibilities with my system.
I tried increasing the length to 10 (which would give like 400 quadrillion possibilities) and after 12 million keys there were no collisions.
I'm trying to understand why 7 isn't good enough.
Thanks.