Thought-exercise / proposal: An 86-bit clock sequence

broofa commented 3 years ago

[Edit: 'did my math wrong. "90 bit" -> "86 bit" throughout]

Observation: Clock precision is not the same as clock accuracy. Even in systems that provide micro or nano-second precision timestamps, the actual time (when compared to "true" USNO naval time) is unlikely to be accurate to more than a few 10's of milliseconds. I.e. Where UUIDs are concerned subsecond timestamp information is arguably indistinguishable from a simple counter. This is evidenced by RFC4122's provision for low-precision system clocks:

A high resolution timestamp can be simulated by keeping a count of the number of UUIDs that have been generated with the same value of the system time, and using it to construct the low order bits of the timestamp.

Observation: In version 1, the main reason for having a distinction between node and clock_seq was because node was expected to be a MAC address. However, that's proven to be a poor design choice and we all agree that a random node is now preferred. As long as there's some reasonably stable, random bits somewhere in the UUID, there's not really much reason for an explicit node field.

Observation: v1 clock_seq is initialized randomly, and incremented as needed to insure uniqueness and monotonicity. It also has desirable behavior around system clock regressions.

With the above in mind, we can expand theclock_seq field to include all of the bits fromnode as well as the bits needed for subsecond timestamp precision:, giving us a UUID layout that looks something like this:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                            unixts                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|unixts |         clock_seq     |  ver  |        clock_seq      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var |            clock_seq     |                clock_seq      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           clock_seq                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

ver and var fields are as we've seen previoiusly in RFC4122.

unixts field is the 36-bit unsigned integer, unix epoch, 1-second resolution, field proposed elsewhere. (Allows for dates from 1970 to the year 4147).

clock_seq is [effectively] an 86-bit, unsigned integer, encoded most-significant bits first, with ver and var fields injected in their usual positions.

clock_seq behaves as follows:

clock_seq bits are all initialized randomly, with the exception of the first (high-order) bit, which is set to 0.
clock_seq may be re-used for successive uuids, with the following provisions (it's worth noting that this behavior is consistent with how clock_seq works in v1 uuids):
- If unixts is unchanged (i.e. in the same clock cycle), clock_seq MUST be incrementied by one (1)
- If unixts has advanced (i.e. a new clock cycle) and clock_seq's high-order bit is zero, no change is needed
- If unixts has advanced (i.e. a new clock cycle) and clock_seq's high-order bit is one, reinitialize as described above
- If unixts regresses, reinitialize as described above

(Note; The high-order bit logic prevents clock_seq from rolling over mid clock-cycle, which might otherwise break monotonicity.)

Observations

UUIDs are guaranteed to be monotonic and sort in time-creation order (maximizes DB locality)

The high-order bits of clock_seq serve the roughly the same purpose as v1's node - they provide a uniqueness guarantee across different systems for uuids generated with the same timestamp. Which bits are considered "high-order" and "stable" vs "low-order" and "dynamic" depends on the uuid creation rate and what the definition of "stable" is. For example, this table shows the number of high order bits that will remain "stable" for 1 day and 1 year at the given creation rates:

uuid creation rate	# of stable bits (1 day)	# of stable bits (1 year)
> 1/sec	67	59
1,000 / sec (~1K)	57	49
1,000,000 / sec	47	39
1,000,000,000 / sec	37	29

edo1 commented 3 years ago

If unixts is unchanged (i.e. in the same clock cycle), clock_seq MUST be incrementied by one (1)

It will suffer from the same problem as the original ULID: the next/previous value is likely to be predicted from the known identifier. To avoid this, there is always a random part in my proposal #34;
The one-second resolution is not enough for ordering identifiers in the case of lock-free concurrent generation;
Most significant bit of clock_seq will almost always be zero, so only 85 bits of true randomness. IMHO this is too little for global uniqueness.

bradleypeabody commented 3 years ago

I think this boils down to the same thing, the arguments brought up are essentially application-specific points:

Where UUIDs are concerned subsecond timestamp information is arguably indistinguishable from a simple counter

Totally depends on the environment. If you're on an IoT device without a real-time clock, what you state is probably totally right (in fact it's probably even less accurate). But if you work in aerospace and are dealing with timestamps from GPS satellites and are taking into account transmission time and relativistic effects, we cannot make the assumptions you're stating. It depends on the application.

the next/previous value is likely to be predicted from the known identifier.

Some applications care about how predictable generated values are, others couldn't care less.

unix epoch, 1-second resolution

There is very little difference between this and and using a 1-second regular unix timestamp, multiplying it by 1 billion to convert to nanoseconds, and add a random number between 1 and a billion to it. And I think this is a totally fine and correct idea for some applications and should definitely be allowed. The difference is that if an application does in fact need nanosecond precision, or anywhere close to it, it is available. Totally agreed that the least significant parts can be filled in with random or clock sequence if it's not needed for a particular case.

These are all tradeoffs that I don't see a single answer to. Which is why I lean toward making each of these things possible to do, so applications can still "implement UUID correctly" and cater to application needs.

@broofa What you're describing, if you're willing to use the other UUIDv7 layout, is accomplish-able and a correct implementation. Essentially there is nothing that stops you from using the "random" part as a big clock sequence like you describe above - IF that's what you need for your application.

broofa commented 3 years ago

Where UUIDs are concerned subsecond timestamp information is arguably indistinguishable from a simple counter

Totally depends on the environment. If you're on an IoT device without a real-time clock, what you state is probably totally right (in fact it's probably even less accurate). But if you work in aerospace and are dealing with timestamps from GPS satellites and are taking into account transmission time and relativistic effects, we cannot make the assumptions you're stating. It depends on the application.

My concern is that the provisions in the spec that allow for filling timestamps with random or sequence bits to "simulate" high precision necessarily undermine the utility of the timestamp information. Upon receiving an id, you have no way of knowing how accurate it actually is. Your choices are to assume its low precision (i.e. don't depend on it being accurate), or bake the assumption that it's accurate into your system. But in the latter case, the stricter your assumption is, the more likely it is to be wrong.

Does that make sense? It's like... well... JavaScripts Math.random(), where there's no specification about the quality of the underlying RNG. You can use it, just don't expect it to work.

Basically, if you're going to argue that it's not necessary to specify cryptographic quality RNGs in UUIDs, you shouldn't be simultaneously arguing for high-precision timestamps.

edo1 commented 3 years ago

Upon receiving an id, you have no way of knowing how accurate it actually is

Is there a reason why a UUID receiver might need to know the accuracy of the clock?

nerg4l commented 2 years ago

This proposal is probably based on ULID or Push ID. I really like how it provides monotonicity in case of overflow but I also have concerns on the time precision.

@edo1

It will suffer from the same problem as the original ULID: the next/previous value is likely to be predicted from the known identifier.

Maybe I understand this part incorrectly of RFC 4122 but it says in Section 6 "Do not assume that UUIDs are hard to guess;".

Most significant bit of clock_seq will almost always be zero, so only 85 bits of true randomness. IMHO this is too little for global uniqueness.

How many bits of randomness is enough for "global uniqueness"? I could be wrong but according to my logic that 85 bits are there to make the UUID unique in a second and not there to make it "globally unique" because the time and random value together will make it "globally unique".

edo1 commented 2 years ago

Maybe I understand this part incorrectly of RFC 4122 but it says in Section 6 "Do not assume that UUIDs are hard to guess;"

If it is possible to create "hard to guess" UUID, we should. This does not make things worse if unpredictable UUIDs are not required.

How many bits of randomness is enough for "global uniqueness"?

Just added this proposal to my spreadsheet As a reminder, the annual collision probability is calculated when ~~a new UUID is generated every nanosecond~~ 10⁹ new UUIDs are generated every second. Probablilty is 33.5% for this proposal, 0.27%/2.11% for #34

bradleypeabody commented 2 years ago

@broofa Circling back around to this:

It's like... well... JavaScripts Math.random(), where there's no specification about the quality of the underlying RNG. You can use it, just don't expect it to work.

I think that argument has the most merit.

I don't think it particularly matters how accurate timestamps are from one system to another, simply because this is always going to be uncontrollable - clock skew, systems without or broken NTP, varying clock precisions - it's just not something that can realistically be controlled, at least not for most implementations. So no matter what the spec says about accuracy, it doesn't mean much if the physical system hosting the implementation doesn't match or comply with it.

That said, after looking through this more carefully, I think the argument that basically goes: "Don't pick a format that will usually have bad/wrong information on most systems" holds a lot of weight.

In doing some testing and research, it looks like at least some modern systems are only microsecond-precise (Mac laptop). And even when more precise numbers are available, unless the code implementing the timestamp uses certain hardware features very carefully it's easy to get numbers that are "nanosecond-precise" but not accurate. So the extra orders of magnitude between microseconds and nanoseconds would probably end up causing more implementation hassle than utility (implementations would normally be filling this in with random data, and almost never be using it communicate a more accurate time).

And so then it raises the question of how much precision is a good combination of useful and will normally be accurate on most systems. Which then raises another question about the exact format. I'm going to bridge that over to https://github.com/uuid6/uuid6-ietf-draft/issues/27

broofa commented 2 years ago

@bradleypeabody Thanks for following up. I appreciate the diligence you've shown in shepherding this spec.

So the extra orders of magnitude between microseconds and nanoseconds would probably end up causing more implementation hassle than utility

Yes, exactly! The hassle of having to figure out how to split the timestamp into msecs and nsecs, and manipulate those fields in a way that didn't inadvertently lose timestamp precision in the process, is why I shudder every time I see "nanosecond precision", or even "microsecond".

kyzer-davis commented 2 years ago

@broofa , since we moved this to #27 is this one okay to close?

uuid6 / uuid6-ietf-draft

Thought-exercise / proposal: An 86-bit clock sequence #35

Observations