gofrs / uuid

A UUID package for Go
MIT License
1.58k stars 111 forks source link

V7 and project status #106

Closed bgadrian closed 1 year ago

bgadrian commented 1 year ago

Hello,

What is the status of this repo? I see some PRs unmerged 1y old. I am interested to update or write a new implementation for V6-7 UUIDs (v4 draft including the distributed requirements) and wondering if I can contribute to this project or create another one.

cameracker commented 1 year ago

Hi @bgadrian , the PRs that are open are either controversial or are waiting for the correct reviewers. We've been merging the v6 and v7 updates we've received from contributors.

cameracker commented 1 year ago

Closing issue because it's a more of a question

dylan-bourque commented 1 year ago

The repo is definitely active but the functionality is stable so there's not a lot of activity. There have been 2 PRs with implementations of the drafts for V6 and V7 UUIDs, #93 and #99. If there's been an update since those then a PR with relevant changes would be welcome.

bgadrian commented 1 year ago

Thanks for the quick answer.

I think the V7 especially from #99 does not fully implement even the v3 of the draft, more specifically

Additionally, care MUST be taken to ensure UUIDs generated in batches are also monotonic https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-03.html#monotonicity_counters

Meaning that only relying on the nanosecond from time may not be enough. Especially on commodity hardware in clouds, the nanoseconds increment are not guaranteed, and generating multiple UUIDs in the same MS could lead to collisions.

In Draft v4 there are more options defined in how to counteract this https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html#monotonicity_counters and/or https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html#distributed_shared_knowledge

I am afraid that trying to implement these features will break the current UUID structure simplicity, as it would require some sort of internal in-memory state (counter) and/or new parameters (nodeID), so I would want to get some opinions before opening a PR on one approach, and define a design agreed by the current project maintainers.

One valid way would be to keep it simple and say that this implementation will not support "Distributed Nodes" and only add support for single node (Monotonicity and Counters), or not ...

dylan-bourque commented 1 year ago

Meaning that only relying on the nanosecond from time may not be enough. Especially on commodity hardware in clouds, the nanoseconds increment are not guaranteed, and generating multiple UUIDs in the same MS could lead to collisions.

Those are fair concerns @bgadrian. I'd counter, though, with the fact that there is no API in this library for generating batches of UUIDs. Each call to NewVN() generates and returns a single value. And nanoseconds is the lowest granularity we can get in Go so I'm not sure what other options there are.

this implementation will not support "Distributed Nodes"

I'd definitely argue for this path. There's definitely nothing in this library today related to synchronized generation across multiple machines.

I am afraid that trying to implement these features will break the current UUID structure simplicity

IMO the simplicity is one of the advantages of this library so I would be very hesitant to start down a path that would lead to breaking that.

For what it's worth, I've used this library to generate sometimes millions of UUIDs per second and have never run into a collision.

bgadrian commented 1 year ago

I don't think batches are supported per se in any implementation, but rather generating a large amount of UUIDs in a small amount of time can be considered a batch (example an insert baching algorithm that shuffles the inserts each X seconds).

One simple implementation for monotonic UUIDs would be to have an internal counter that resets at each ms (lowest V7 granularity) and increments at each generated UUID. From the client point of view/the interface will remain the same (simple). I think this is explained here https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html#section-6.2

I'd definitely argue for this path. There's definitely nothing in this library today related to synchronized generation across multiple machines.

Actually there is, the V3 HardwareAddress NewGenWithHWAF( that can be injected at build serve also this purpose, to avoid collision across multiple nodes. This scenario is also decribed in V7

Node IDs: With this method, a pseudo-random Node ID value is placed within the UUID layout. This identifier helps ensure the bit-space for a given node is unique, resulting in UUIDs that do not conflict with any other UUID created by another node with a different node id.

I think we can easily implement this by generating a random UUID at construct (or singleton per process) and inject it in each generated UUID V7.

These 2 simple additions can guarantee the monotonic and local uniqueness (and avoid collisions) properties while keeping the interface simple.

dylan-bourque commented 1 year ago

Actually there is, the V3 HardwareAddress NewGenWithHWAF( that can be injected at build serve also this purpose

While that's theoretically possible, injecting a function that does RPCs to other machines to do synchronization definitely goes against the spirit of that API.

Again, though, we'll definitely review a PR to close any gaps in V6 or V7 UUIDs.

bgadrian commented 1 year ago

I am not aware of any RPC or cross process communication and I was not suggesting one. I was only referring to the existing code NewGenWithHWAF https://github.com/gofrs/uuid/blob/master/generator.go#L149 . MacAddress exists in those UUIDs to eliminate the collisions between processes. Same principle was applied in new UUIDs but instead of Mac address (which is a security issue) suggest to generate a random set of bytes for each process. Sure, thanks! I'll try and play around

convto commented 1 year ago

I so glad to see this discussion!

This was also a point I was difficult to decide on a policy for implementation when I submitted PR #99. (I intentionally did not implement and discuss related to batch generation at that time in order to narrow the scope of the PR) I think there is a certain value in sharing here what I considered at the time.

First, the number of random bits in UUIDv7 without awareness of monotonic counter is 74 bits.

Next, when the number of random bits is N and the collision probability P=0.5, the number of trials required can be calculated as 2^(N/2). I have often seen this value used in discussions of UUIDs and other unit time generation capabilities. (For background on this calculation, the following article is helpful: https://auth0.com/blog/birthday-attacks-collisions-and-password-strength/ )

From these, UUIDv7, which is unaware of the monotonic counter, has a 74-bit random, so its generation capacity can be calculated to be 2^37 when P=0.5, that is 137.4 billion (Incidentally, the generation capacity of UUIDv4 at P=0.5 is 2^61, that is 2.3 quintillion).

I think that a generation capacity of 137.4 billion in msec is extremely unlikely to cause a collision in almost all use cases (even for batch applications).

In the Rev04 draft, the specification for monotonic counters has been changed to should https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html#name-monotonicity-and-counters

I didn't implement it at the time for these reasons, but it's important for library users to have a lot of options, so if you could implement it, that would be great!