Add brief explainer of how unlikely collisions are

bcoe commented 5 years ago

In reading TIFU Math.random(), I came across this statistic for the approach being used by betable:

With 2¹³² possible values, if identifiers were randomly generated at the rate of one million per second for the next 300 years the chance of a collision would be roughly 1 in six billion.

If we could work out a similar statistic for UUID V4, generated using a CSPRNG, it might be a neat little note to add into the README; as a way of reassuring folks who are concerned about randomness, and wish there was a clock in the mix.

broofa commented 5 years ago

Found some useful stats and equations at https://en.wikipedia.org/w/index.php?title=Universally_unique_identifier&oldid=755882275#Random_UUID_probability_of_duplicates

broofa commented 5 years ago

Using the approximation from that article:

P(n) ≈ n²/(2 * 2^x)

For x = 122 (v4 UUIDs have 122 bits of entropy), and n = 9.46x10¹⁵ (1M uuids / second * 300 years, per the Betable example), P(n) = 8.4x10^-6... or one in 8.4 million chance of collision.

bcoe commented 5 years ago

@broofa neat :smiley: for me, the 100 year timeline (as used in the Wikipedia article) seems less arbitrary than 300 (what are the odds of a collision in the upper-bound of a human lifespan).

ctavan commented 5 years ago

I think this was mostly resolved with #17.

However during a discussion in https://github.com/microsoft/azure-pipelines-tasks/pull/11021#issuecomment-519059402 I realized that it might be worthwhile to actually add a comparison with regards to how unlikely a collision in v4 UUIDs are vs. how unlikely a collision in v1 UUIDs are? WDYT?

broofa commented 5 years ago

That would be an interesting comparison, but it's complicated by the fact that the [theoretical] odds of collision for v1 ids depend on some rather churlish aspects of the RFC.

For version 4, collision probability is pretty easy. The theoretical probability of two UUIDs colliding, P_c, is:

P_c = 1 / 2^{(# of bits of entropy)}

I.e., for v4, where there are 122 bits of entropy, P_c = ~1.8e-37

For version 1, however:

P_c = P_t X P_n

Where:

P_t: probability IDs are generated in the same 100-nanosecond time interval P_n: probability IDs share the same node

Unfortunately, neither P_t nor P_n are particularly easy to quantify. For P_t, the RFC does require that UUID services should generate an error if they expect to generate more than one ID in a given 100-nsec time interval, but realistically that's only enforceable within a single thread/process on a single host. Enforcing this across multiple processes / hosts requires non-trivial architectures that run counter to the main thesis the UUID spec: "One of the main reasons for using UUIDs is that no centralized authority is required to administer them".

So, in practice, P_t is non-zero. For example, in our README scenario where IDs are generated at a rate of 1M/second, P_t = 0.1 ... which ain't great, and causes us to turn the spotlight on P_n, the probability the IDs share the same node value.

Unfortunately the mechanism for generating node values preferred by the RFC is to use the host system's IEEE 802 MAC address. This made sense back when the RFC was authored and MAC addresses could reasonably be expected to be unique, but this is arguably no longer the case, not with the proliferation of virtual machines and containers where MAC addresses may not be unique by design.

Fortunately the RFC provides an escape hatch - the node can be randomly generated if desired. However, a random node only provides 48 bits of entropy, giving P_n = 3.6e-16.

If we extrapolate this onto our README scenario (1M UUIDs/second), with an expected P_c = 3.6e-17 (0.1 X 3.6e-16 ) for v1 uuids, we find that instead of it taking 327 years to get to a 1-in-a-million chance of collision, it only takes about 30 seconds. :-/

... And what we do with that knowledge I don't know. In hindsight, I don't really know if this is a useful comparison (and, btw, someone should probably check my logic and #'s), but this is about all the energy I have for thinking about this at the moment.

ctavan commented 5 years ago

Thanks for this excellent reasoning, @broofa!

I get a result off by one decimal for the random node with 48 bits of entropy:

> 1/Math.pow(2, 48)
3.6e-15

So P_c = 0.1 x 3.6e-15 = 3.6e-16.

Looking at the current readme again I'm wondering whether we're off by 10 there as well?

Generating 1M UUID/s for 327 years yields a collision probability according to the approximated equation:

> Math.pow((1e6 * 3600 * 24 * 365 * 327), 2)/(2 * Math.pow(2, 122))
0.000010000443315519015

Which would be 10 in a million, not 1 in a million.

If we extrapolate this onto our README scenario (1M UUIDs/second), with an expected Pc = 3.6e-17 (0.1 X 3.6e-16 ) for v1 uuids, we find that instead of it taking 327 years to get to a 1-in-a-million chance of collision, it only takes about 30 seconds. :-/

Could you briefly explain how you arrive at 30 seconds given P_c?

Where does all that takes us? Well, all I want to do is to counter the idea that v1 UUIDs are guaranteed to be collision-free. I want to make clear that given the fact that there's no practical way of ensuring node uniqueness, also for v1 there are chances of collisions in practice and that those are even higher with v1 UUIDs than with v4.

I'll propose a new FAQ item once we have our numbers straight.

broofa commented 5 years ago

Yeah, apparently I screwed up all my math. Not sure what happened. 'Guess I was more tired than I thought. 😦

So P_c = 0.1 x 3.6e-15 = 3.6e-16.

Right you are.

Math.pow((1e6 3600 24 365 327), 2)/(2 * Math.pow(2, 122)) 0.000010000443315519015

Which would be 10 in a million, not 1 in a million.

Yup, my bad.

In an attempt to not screw this up a second time, I've put together a birthday problem repl.it for us to reason with. It uses the taylor series approximation for the birthday problem.

It shows that P = .000001 after 104 years for v4 @ 1M uuids/second. It also shows that for our updated v1 probability, it takes all of 75 milliseconds (not 30 seconds!) to hit that same probability.

Could you briefly explain how you arrive at ~~30 seconds~~ 75 milliseconds given P_c?

First, yes, 75 msecs is not 30 seconds. 'Not sure how I got that 30 second number. However the basic rationale is this: Since P_c - the odds of two UUIDs colliding - is equal to 1 / (effective size of UUID space), we can just substitute 1/P_c in where we had Math.pow(2, 122) for v4.

... which is what I've done in the repl. Hopefully that's the right approach.

Stepping back, this feels weird. Like... really? If I spin up two independent v1 uuid services and start generating 1M ids/second, I don't actually expect to get duplicates right away. That's just silly. But the reason this feels unintuitive is because the odds of them having duplicate nodes are low (1 in 2⁴⁸). The problem is, that one time you're unlucky and end up with the same node on both services, 10% of the ids - 100K/second - are colliding, so you get a LOT of collisions.

I think this is why contrasting v4 and v1 collision probabilities is difficult. v4 has this miniscule probability of collision for each and every UUID produced. But for v1, the odds of collision are exactly zero... except for the very rare case when it's a near certainty! As I think I said before, they have very different natures.

Does this make sense? I'm just ruminating out loud here... trying to convince myself that this is valid comparison, I guess.

ctavan commented 5 years ago

@broofa I think your reasoning makes sense and I have tried to compress it into a new FAQ item.

tc39 / proposal-uuid

Add brief explainer of how unlikely collisions are #15