Discussion: Redefine variant bit (111) definition

kyzer-davis commented 3 years ago

Continuation of separate #24 thread

Question: Should we redefine UUID variant bits 111 (E/F) which are currently "Reserved for future definition." as per the original RFC 4122 Section 4.1.1 source?

Proposal:

In Draft 02 set definition of any UUID Version (UUIDv1/2/3/4/5/6/7/8) + Variant E (111) as a method for signaling an alternative bit layout to any previously defined UUID.
With that precedent set UUIDv6 could be converted to UUIDv1 (0001) + Variant E/F (111) as a method to signal the alternative encoding for UUIDv1 labeled as UUIDv1ε.

Possible text changes depending on feedback from this issue and #24

As such the Draft 01 goes from the current three definitions:	Name	Version	Variant
UUIDv6	0110	10x (8/9/A/B)	UUIDv1 with Re-ordered Gregorian timestamp, explicit start sequence counter, no MAC address
UUIDv7	0111	10x (8/9/A/B)	36-bit Unix epoch timestamp, variable subsecond encoding up to nanoseconds using floating point math and fractions (38 bits allocated to subsecond precision).
UUIDv8	1000	10x (8/9/A/B)	Relaxed implementation, any timestamp goes, future proof the specification, 122 bits to do as you desire with general guidelines of timestamp, sequence, random in that order

To the possible four definitions in Draft 02:	Name	Version	Variant
UUIDv1ε	0001	111 (E/F)	UUIDv1 with Re-ordered Gregorian timestamp, explicit start sequence counter, no MAC address. (What is UUIDv6 in draft 01)
UUIDv6	0110	10x (8/9/A/B)	36-bit Unix epoch timestamp, variable subsecond encoding up to nanoseconds using floating point math and binary fractions (38 bits allocated to subsecond precision). (What is UUIDv7 in draft 01)
UUIDv6ε	0110	111 (E/F)	UUIDv6 with 36-bit Unix epoch timestamp, variable subsecond encoding up to nanoseconds using integers to represent total number of subseconds (30 bits allocated to subsecond precision). (Did not exist in draft 01)
UUIDv7	0111	10x (8/9/A/B)	Relaxed implementation, any timestamp goes, future proof the specification, 122 bits to do as you desire with general guidelines of timestamp, sequence, random in that order. (What was UUIDv8 in draft 01)
UUIDv8	1000	10x (8/9/A/B)	Goes away in draft 02 as it is no longer required.

nerg4l commented 3 years ago

Could you clarify for me what does "variable subsecond encoding up to nanoseconds using floating point math and fractions" and "variable subsecond encoding up to nanoseconds using integers to represent total number of subseconds" mean or what is the difference between them?

kyzer-davis commented 3 years ago

@nerg4l, the topic being discussed in #24 if we kept the current UUIDv7 and just changed it to UUIDv6 (since UUIDv6 becomes UUIDv1ε). Then we can also add an alternative encoding that uses the 30-bit variant without any floating point math and binary fraction encoding as UUIDv6ε. We get the best of both words. UUIDv7 becomes what was UUIDv8 and I drop UUIDv8 from the draft.

I edited the table to add more clarity.

bradleypeabody commented 3 years ago

@kyzer-davis I was thinking, for simplicity, we would only define a meaning for the variants + versions that we actually want. Meaning basically that UUIDv6 I think stays as it is with the old variant, and UUIDv7 and 8 use the new 111 variant. Otherwise I think we have too much variation without any real benefit.

nerg4l commented 3 years ago

Thanks for the clarification.

I don't think it makes sense to create UUIDv6 and UUIDv6ε. Instead of pleasing everyone we should have a clear decision on which one to have. This would simplify implementations by having one less UUID to implement and would help keeping the RFC less complex. Also UUIDv6ε would probably only apply to nanosecond precision.

nerg4l commented 3 years ago

I looked more into this and found two things which should be taken into consideration.

I'm not sure if it is relevant in case of extending the RFC but ITU also has a UUID definition in X.667. Which states the following:

11.2 All UUIDs conforming to this Recommendation | International Standard shall have variant bits with bit 7 of octet 7 set to 1 and bit 6 of octet 7 set to 0. Bit 5 of octet 7 is the most significant bit of the Clock Sequence and shall be set in accordance with 12.4.

NOTE – Bit 5 is listed here as a variant bit because its value distinguishes historical formats. Strictly speaking, it is not part of the variant value for this Recommendation | International Standard, which uses only two bits for the variant.

I also checked how variant#0 (NCS) UUID looked like. It seems, it does not have a version bit. https://opensource.apple.com/source/CF/CF-299.35/Base.subproj/uuid.c.auto.html

 * Internal structure of variant #0 UUIDs
 *
 * The first 6 octets are the number of 4 usec units of time that have
 * passed since 1/1/80 0000 GMT.  The next 2 octets are reserved for
 * future use.  The next octet is an address family.  The next 7 octets
 * are a host ID in the form allowed by the specified address family.
 *
 * Note that while the family field (octet 8) was originally conceived
 * of as being able to hold values in the range [0..255], only [0..13]
 * were ever used.  Thus, the 2 MSB of this field are always 0 and are
 * used to distinguish old and current UUID forms.
 *
 * +--------------------------------------------------------------+
 * |                    high 32 bits of time                      |  0-3  .time_high
 * +-------------------------------+-------------------------------
 * |     low 16 bits of time       |  4-5               .time_low
 * +-------+-----------------------+
 * |         reserved              |  6-7               .reserved
 * +---------------+---------------+
 * |    family     |   8                                .family
 * +---------------+----------...-----+
 * |            node ID               |  9-16           .node
 * +--------------------------...-----+

Unfortunately, I could not find anything specification about Microsofts' variant#2.

A lot of people refer to UUDs defined by RFC 4122 as variant#1 version x UUIDs. RFC4122 also states the following about variant and version:

[...] The UUID format is 16 octets; some bits of the eight octet variant field specified below determine finer structure. [...]

[...] As such, it [variant] could more accurately be called a type field; we retain the original term for compatibility. [...]

[...] The version is more accurately a sub-type; again, we retain the term for compatibility. [...]

There for, I assume using variant#3 would allow to redefine the the structure entirely. Moving or removing version from the definition for example. Probably, current implementations of RFC4122 should be ignored because keeping BC looks impossible.

edo1 commented 3 years ago

I assume using variant#3 would allow to redefine the the structure entirely. Moving or removing version from the definition for example

It is the last unused variant, so there is no room for error.

fabiolimace commented 3 years ago

I also think the version bits (subtype) are specific to the RFC4122 variant (type), which has many subtypes that must be separated from each other. Variant 3 (111) doesn't even have a structure yet.

This file appears to be the basis for Apple's implementation: https://github.com/BeyondTrust/pbis-open/blob/master/dcerpc/uuid/uuid.c

broofa commented 3 years ago

Using the variant field to signal different bit semantics within RFC 4122 versions is not appropriate. The variant field is the overarching field that dictates layout and semantics of all other bits in a UUID. RFC4122 is very deliberately scoped to just variant == 0b10x. Hell, version isn't even defined outside of that specific variant.

I believe a more correct approach would be to use a different version to distinguish between timestamp encodings, much the way v3 and v5 distinguish between namespace hash algorithms.

And, yes, this would effectively double the number of new UUID versions being proposed, which is not ideal. This is one reason I'd like to see this proposal culled back to just 1 new timestamp version, per #30.

edo1 commented 3 years ago

Using the variant field to signal different bit semantics within RFC 4122 versions is not appropriate.

What about using variant=0b111 in a new format (without version)? Expanding the variable part by three bits reduces the probability of collisions by almost an order of magnitude.

broofa commented 3 years ago

What about using variant=0b111 in a new format (without version)? Expanding the variable part by three bits reduces the probability of collisions by almost an order of magnitude.

You mean the version part? I suppose you could do that. But imho, defining a new variant should be a Big Deal™. It should be motivated by the need for a whole new class of UUIDs, or by having exhausted the available version options, which we haven't done yet. E.g. if there was a need to move the version field to the end of the UUID (to improve db-locality?), or it needed to be 6 or 8-bits wide instead of 4.

I don't see such a need at this time.

edo1 commented 3 years ago

But imho, defining a new variant should be a Big Deal™

Agree. There were no sortable UUIDs in the standard. Is this a Big Deal™? Seriously though, I want the random part to be as large as possible.

bradleypeabody commented 3 years ago

I tried to lay it out here https://github.com/uuid6/uuid6-ietf-draft/blob/master/LATEST.md as best I could. But the idea is if the variant field is set to 0b111, this would mean the version field fits in the bottom (least significant) bits of the 9th byte (so var and ver are in this one byte). This technically loses us only 1 bit - since variant was 3 bits, and version was 4 but now we using 8.

I agree that we should not try to make new variants of v1, v4, etc. or v6 for that matter (since its goal is to be easily adaptable from and as close to v1 as possible). But I think we can use it in v7 and v8 to simplify the bit layout so there's just one byte you have to worry about when determining version info for v7 and v8:

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | var |  ver    |                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

I'm hoping this can help move things toward greater simplicity in the spec/proposal.

broofa commented 3 years ago

I'm hoping this can help move things toward greater simplicity in the spec/proposal.

Before	After
RFC4122 is variant 0b10x	RFC4122 is variant 0b10x or 0b111
2¹²² UUIDs per version	2¹²² or 2¹²⁰ UUIDs per version
`version` in bits 48-51	`version` in bits 48-51 or bits 67-70

This is not "greater simplicity". And the rift this creates in how versioning works is going to be an ongoing pain in the ass to articulate and rationalize about.

"Look for versions 1-5 here if variant is 0b10x, and versions 6-31 there if variant is 0b111. What's that...? What if variant is 0b111, but the version field < 6? Uh... well... that's not really a thing. I mean, it's technically possible, but we didn't want to confuse people by having different versions with the same name."

Now that I think about it, that last part about 0b111 versions 1-5, is actually pretty awful. That those particular variant-version combinations are possible just seems ripe for confusion and abuse.

broofa commented 3 years ago

In Draft 02 set definition of any UUID Version (UUIDv1/2/3/4/5/6/7/8) + Variant E (111) as a method for signaling an alternative bit layout to any previously defined UUID.

This is clearly overkill. Version 1 is the only version where there is any value in rearranging bit layout.

bradleypeabody commented 3 years ago

the rift this creates in how versioning works is going to be an ongoing pain in the ass to articulate and rationalize about.

The logic I'm proposing in https://github.com/uuid6/uuid6-ietf-draft/blob/master/LATEST.md would be simply:

For UUID7 bytes[9] == 0xE7, for UUID8 bytes[9] == 0xE8.

And for UUID versions <= 6 do what RFC4122 indicates.

That's it.

It is a difference, but I don't think it's overly complicated and confusing.

Now that I think about it, that last part about 0b111 versions 1-5, is actually pretty awful.

I agree, we should just disallow other uses of the 0b111 variant, since I don't think there's any real benefit to allowing it.

broofa commented 3 years ago

The logic I'm proposing... ... I don't think it's overly complicated and confusing

I understand what you're proposing. And I agree that, had this been the bit layout RFC4122 used from the beginning, it would be simpler than what we have now. But it's not. It's a new layout in addition to what the original RFC specifies, so the complexity it introduces is in addition to the complexity that's already there.

I agree, we should just disallow other uses of the 0b111 variant, since I don't think there's any real benefit to allowing it.

I think you're missing the point. It's not about what exactly we say around this issue (although we'll have to make some decisions there), it's that we have to say anything at all. This change causes a certain amount of cognitive dissonance that can be avoided altogether if we just stick with the current scheme.

bradleypeabody commented 3 years ago

This change causes a certain amount of cognitive dissonance

RFC4122 causes it's own cognitive dissonance. The fact that one of the backward compatibility issues that comes up is implementations which check the version bits without checking variant - that's a great example of just strange stuff that should, IMO, never have been there in the first place, and should be deprecated. Unnecessary complexity that people already get wrong because nobody wants to sit down and read the RFC4122 spec, because it's long and unnecessarily complicated. (No offense to the original authors, I'm just saying that today, with the factors at hand, we can do better.)

Some people will choose either just implement UUIDv7+8 or just leave the existing implementations of e.g. v1 and v4 (the most common existing implemented versions) - leave the old code as-is and just write new code for UUIDv7+8. This new code can be simpler and easier to understand. This is a benefit that needs to be measured and compared against the factor of making it different from prior versions.

RFC4122 has many problems. One of the goals here is to "fix" them by introducing a new, simpler design that people can just move forward with and leave the old stuff behind. Once we have new UUIDs, people won't be obligated to continue to have to deal with RFC4122 - if newer versions solve real-world problems, then great, new versions can be implemented and that's it, done deal. I'd much rather focus on making this new draft/spec as simple as possible (while not being unrealistic about backward compatibility issues), than forcing some old stuff from RFC4122 that we don't need.

I will find some examples of implementation differences and post here and hopefully this can help provide a convincing argument of the value of this. But the basic issue I have is I don't think it's simpler to leave things as they were in RFC4122. RFC4122 is a mess, we should move away from it. And if we can do so with only a few manageable backward compatibility concerns, I think it's a workable approach and better in the long run. Again I'll post some code soon to help demonstrate this factor of code complexity.

bradleypeabody commented 3 years ago

If we merge variant and version fields into one byte and combine it with using a common time format, we get things like this:

https://play.golang.org/p/yWjgCNy_GQq

    var v [16]byte
    binary.BigEndian.PutUint64(v[:8], uint64(time.Now().UnixNano()))
    v[8] = 0xE7
    rand.Read(v[9:])

That is a correct and useful UUIDv7 implementation (per these notes, not the draft). It lacks a guarantee of monotonicity for values produced within the same clock tick (which IMO should be a recommendation not a requirement), but this can be added with just a few more lines of code.

Try finding an earlier UUID implementation that is anywhere near that simple to write and understand.

Simplicity in implementation and maintenance is a very real, tangible factor. It needs to be weighed carefully against the cost of changing things. And if we don't change old UUID version <= 5 values, I really don't see the problem.

Is there some specific real-world problem that this would cause that I'm missing? What specifically (which database program, library or software problem) would happen/break/be difficult/annoying/etc if we were to move forward with this? Maybe if I have an actual example of what you're worried about I could better think with with it.

fabiolimace commented 3 years ago

I'm still not sure if using half of the future variant is a good idea. I prefer to be conservative in this case. It may be easier to approve the changes.

But I like the simplicity it makes possible. Much of the discussion here arises from the need to work around version bits.

If you really plan to use half of the '111' variant, why would you want to use a version number? Version numbers belong to the '10x' variant. The '111' variant is an uninhabited land. You have the opportunity to create an entirely new layout. There is no need to be stuck with the '10x' variant design, which depends on the version number to differentiate between UUID subtypes.

I think it's better to just define the E variant (enhanced, extended?) and forget about the E7 and E8 versions. Or you can use ONE bit of the E variant as a flag to differentiate between E-UUIDs that have timestamps and those that don't.

I am concerned about reducing the UUID size from 122 to 120 bits. Losing 2 bits can result in a significant increase in collision probability. If you don't use the version number the amount of free bits for entropy increases.

broofa commented 3 years ago

Is there some specific real-world problem that this would cause that I'm missing?

These sorts of questions are harder to answer:

"Is [some UUID] valid?"
"What version is [some UUID]?"
"How do I identify UUIDs in text?"
"I have a valid variant (0b111) and a valid version (3), so why is my UUID invalid?"

... and this sort of code becomes more complex:

bradleypeabody commented 3 years ago

Fair points. You're correct, it does add some more logic to these situations.

However, variable length will break the "find a UUID in text" anyway. So will Crockford Base32 encoding.

And extracting the version is an extra line or two of code to fix that code. (The version.js there btw is another example of broken code - it should be checking the variant bits.)

So I think it's a matter of comparing what happens when the points above are broken or made more difficult, vs the fact that all newer implementations (and some of which will only need to support e.g. UUIDv7 and v8) can be simpler.

UUIDs are supposed to be as opaque as possible. I would also wager than much of the code that is trying to extract version numbers and perform validation is probably doing something not terribly relevant to what most applications need anyway. Why are people checking the version? (can't you just use the opaque value) Why are they checking if a UUID is valid? (are you sure you can't just compare to all zeros to determine if there's a UUID here or not?)

bradleypeabody commented 2 years ago

To follow up from earlier discussion and from #58, my current stance on this is that the simplicity combining the variant and version fields introduces is worth the downsides.

So far the down sides that have been brought are, along with by rebuttal:

Added complexity/different from RFC4122

I understand the concern. The procedure for examining the version is explained in the new draft with two sentences:

extracting the version number can be done by examining the variant field at bits 64 and 65 for the values 1 and 0 respectively, and then extract the version from bits 48 through 51. UUID versions 7 and 8 can be identified by checking octet 9 for the values 0xE7 or 0xE8 respectively.

Yes, it is different. My opinion is that this does not present too much complexity. The first sentence is just reiterating what RFC4122 says much more verbosely, and the only thing being added is "UUID versions 7 and 8 can be identified by checking octet 9 for the values 0xE7 or 0xE8 respectively."

It reserves two extra bits.

Concerns over the loss of bits being problematic are application specific, and the introduction of variable length UUIDs IMO addresses this concern. A fixed 128-bit value is much more problematic when it comes to concerns about collision probability or unguessability. So I think having those two bits reserved for future use to make one whole byte be devoted to the version is an acceptable tradeoff in the interest of simplicity, considering you can add plenty more bytes to your UUID to further reduce collision resistance if your application really needs it. No need to worry about 2 bits when you can add many more if you like.

If this ends up making it into an RFC, I suspect many new implementations will just implement UUID version 7 and/or 8 and not bother with the rest. IMO, making these implementations simpler should be a priority.

LiosK commented 2 years ago

I don't see a real reason to change the variant bits to develop a time-ordered UUID format. Implementing a UUIDv7 generator is an easy job that can be done by just 100 lines of code in many languages, even with the old, weird version/variant layout. The reorganized layout might reduce some lines of code, but I don't think that's worth sacrificing the future extendability of the UUID standard. It's possible in the future that another new UUID format really really needs to move the version bits, and then if no variant is left, the UUID standard will die.

That said, if the last variant should be consumed now, I think the new format should use a different name than version 7. Variant 10x Version 7 may be defined in the future, and such definition should be named as "UUIDv7" to keep consistency with UUIDv1-5. Therefore, Variant 111 Version 7 should be named differently, or Variant 111 series should be started from version 1 with a different naming convention.

kyzer-davis commented 2 years ago

@LiosK, with the placement of the variant+version in the same octet we actually extend variant 111 to be used by a future implementation if they desire. I detail this a bit more in the Draft 03 file found PR #58 if you want to take a look at the proposed text.

Long story short: We set the 3 variant bits to 111 and dictate the next following bit is always a 0. Thus 1110 = E. This is followed by the four bit version in our new variant; but any future spec may specify that if they want to use 111 the next bit should be set to 1 making 1111 (F), and ultimately a new variant is born for whomever to do what they want. I wanted to ensure we did allow for future extensibility of the UUID spec even though there have been no new additions in ~16 years.

As for setting 1110 and starting with version 7 instead of starting over the version counting: This was the conversation between myself and Brad on the topic back in August of 2021:

Kyzer: There is no reason we need to start at version 7 since our bit space is all to ourselves now with this variant. Basically variant 111 + version 1 and version 2 don’t conflict with RFC4122s version 1 and 2.

Brad: I agree with this in principle, but it creates a new problem of explaining to people what the numbering system is. Just calling it "version 7" and saying "in version 7, byte 8 is set to 0xE7" is really simple to understand and follow. I'm open to a proposal of a different numbering system for this 0b111 variant, but I'm not sold enough on the benefits to originate it myself.

I could go either way after all UUIDv1ε vs UUIDv1 was my thought originally on how to distinguish Variant 1110/E + Version 1 vs RFC 4122 variant 10xx/89AB + Version 1

LiosK commented 2 years ago

Makes sense. My concern is addressed. I am yet to be convinced that the variant bit change is necessary because the simplicity that will be achieved sounds trivial to me, but let's see how others think. Thank you for your clear explanation.

fabiolimace commented 2 years ago

How about defining parallel versions in two variants: v7, v8 and E7, E8? Is it an overkill?

In v7 and v8, the version bits are kept in the same position as in the 10xx variant. These versions can be used by those who are conservative.

In E7 and E8, the version bits are placed side by side with the new 1110 variant bits.

The 48-bit timestamp fits both v7 and E7.

E7 and E8 can be expanded up to 64 bytes.

+--------+--------+------------------+
|     VERSION     |                  |
+--------+--------|   DESCRIPTION    |
|  10xx  |  1110  |                  |
+--------+--------+------------------+
|   v1   |   --   |  Time-based      |
|   v2   |   --   |  DCE-security    |
|   v3   |   --   |  Name-based MD5  |
|   v4   |   --   |  Random-based    |
|   v5   |   --   |  Name-based SHA1 |
|   v6   |   --   |  Time-ordered    |
|   v7   |   E7   |  K-sorted        |
|   v8   |   E8   |  Custom          |
+--------+--------+------------------+

v7:   |....time....|M...N...............|

E7:   |....time....|....NM..............|  (...)  ..............................|

v8:   |.............M...N...............|

E8:   |.................NM..............|  (...)  ..............................|

This comment is similar to @kyzer-davis original proposal for Draft 2 :)

EDIT: added "E7 and E8 can be expanded up to 64 bytes".

broofa commented 2 years ago

any future spec may specify that if they want to use 111 the next bit should be set to 1 making 1111 (F)

@kyzer-davis this was the missing piece for me. Future-standards have to have a way of distinguishing themselves from existing standards.

Re: How about defining parallel versions in two variants

This is a classic "worst of two worlds" solution, IMHO. Pick one or the other... but let's please not be wishy-washy about how implementors indicate and detect versions. It will just lead to yet more confusion.

@bradleypeabody You've cited and rebutted the arguments against this proposal, but I have yet to hear a compelling argument in favor of it. Is there a benefit here beyond the aesthetics of how fields are laid out? While I understand the appeal, that's not solving any actual problems we have. It's just "nice", but that's not a sufficient argument.

The problem with establishing a new variant now, for no reason other than aesthetics, is that we are not in a good position to anticipate the needs of future spec authors (assuming there are any). They (our future selves?) may be able to put the 0b111 variant to better use, so why not leave that option open to them?

bradleypeabody commented 2 years ago

This is a classic "worst of two worlds" solution, IMHO. Pick one or the other... but let's please not be wishy-washy about how implementors indicate and detect versions. It will just lead to yet more confusion.

I wholeheartedly agree with this.

have yet to hear a compelling argument in favor of it. Is there a benefit here beyond the aesthetics of how fields are laid out?

I just want to throw this out there: I think I look at this entire subject a bit differently, and it might be part of the differences we have on this. From my perspective, RFC4122 has some things that are good but it also has a lot of things that are, with the benefit of hindsight, unnecessary. When we consider what aspects to keep and which to change in this new spec, I tend to hear arguments about not changing something from what it is in RFC4122. However, when I look at existing attributes of RFC4122 I ask "do we actually need to keep this?" and "is this good? do we really want this?" Some things we can't get rid of because it will break a lot of existing code. I think we all agree that we can't move the variant field because it will explode a bunch of existing implementations. Fine, that makes sense. But as we get into the other fields when we talk about compelling reasons or justifications for things, I don't see a strong justification for keeping old things they way they are just because someone wrote it in an earlier spec. If it breaks existing implementations, that's something to consider. But hauling around unnecessary complexity from RFC4122 because we don't want to change things too much - I just fundamentally don't think about it like that.

One of the main goals here is to make something useful so databases and other code that needs to make unique identifiers can easily and effectively do so. And I think each of these questions should be measured against that.

So to me this issue of moving the version number is more about answering "can we just make this simpler so new implementations can get rid of the old baggage?" I think once UUIDv6,7,8 are out and specified, a lot of new implementations won't bother implementing the earlier versions (you will notice that at least some existing UUID implementations do not implement all 5 UUID versions, they implement the ones the author understood and deemed useful). Maybe I'm wrong and maybe that comes across as hubris (it's not meant that way), but I really am trying to make it so when people reach for a spec to "make me an ID" they find something simple and easy, not RFC4122.

fabiolimace commented 2 years ago

One of the main goals here is to make something useful so databases

I think UUID v6 and v7 can also be useful for event-driven applications.

This specification suggests using UUID v6 for event IDs: HTTP Feeds.

broofa commented 2 years ago

@bradleypeabody: I recognize the value of a first-principles take on this, I just don't believe it's warranted at this time. If we were creating a radically new standard that deprecated 4122 then, yes, a new variant makes sense. I believe that's what the original RFC authors did. They took the OSF DCE spec for UUIDs, incorporated it as "version 2", and then promptly went on to define a superset that obviated the need to care about what OSF DCE uuids were.

We're not doing that. Or, at least, that's not the sense I get. For example, we're not proposing a replacement for version 4 or 5 (or 3). So the existing RFC will remain important and relevant for some time to come.

I don't see a strong justification for keeping old things they way they are just because someone wrote it in an earlier spec.

This presumes the onus is on I and others to justify we this change should not happen. I disagree with that presumption. The onus is on you to justify why it is needed. Hence, my question. In case it's not clear, my "bar" for justifying a new variant is simple: Create a new variant when the current variant fails to meet existing needs.

So what exactly about the new versions being proposed demand a new variant? The only thing I've seen might be variable length UUIDs. But that idea is not well-fleshed out, nor is it essential.

can we just make this simpler so new implementations can get rid of the old baggage?

Nope. Not gonna happen. Whatever warts 4122 version 1 ids may have, version 4 is killing it. It's 80+% of use cases (probably more like 90% if we're being honest). The vast majority of people using UUIDs won't easily be convinced to migrate to a new standard anytime soon.

LiosK commented 2 years ago

Create a new variant when the current variant fails to meet existing needs.

is a good rule of thumb in my opinion to develop an effective spec. Textual encoding is a problem common to all UUID versions, and thus a new one can be applied equally to v1-v5 as well as v6-v8. The variable length idea can also be applied to v1-v5 if it is labeled as, say, "UUID with extended (entropy / shared knowledge) field". The use of new variant denies (or signals to deny) such an organic extension to the existing UUID standard.

BTW IMHO, though this is another issue (scope issue), the new textual encoding and the variable length specs should be removed from the draft because I do not see a clear need for standardization nor believe they are the best solutions for the problems. Without a clear need, I am concerned if sufficient involvement of experts can be obtained to develop the best solutions, especially when a lot of developers are focused on the existing urgent needs for lexicographically sortable UUID.

broofa commented 2 years ago

Worth noting: Laying claim to the 0b111 variant means that any/all future UUID specifications that need to distinguish themselves from existing UUIDs can only do so by setting all 8 bits of the var_ver field.

If we stick with the RFC field layout, future specs need only set the 3-bit variant field to indicate "this is not an RFC UUID".

kyzer-davis commented 2 years ago

@broofa

Laying claim to the 0b111 variant means that any/all future UUID specifications that need to distinguish themselves from existing UUIDs can only do so by setting all 8 bits of the var_ver field.

They, future specification writers, only need to signal the 4 bits (1111) at the start; the remaining 124 can be up to them. That is enough to ensure the bit-space does not overlap.

See the current Draft 03 text in 4.1. Variant and Version Fields and let me know if this is signaled properly with what I wrote.

broofa commented 2 years ago

They, future specification writers, only need to signal the 4 bits (1111) at the start; the remaining 124 can be up to them. That is enough to ensure the bit-space does not overlap.

Ah, okay. I somehow missed the part about variant being 4 bits instead of 3. That makes sense.

ben221199 commented 2 years ago

For me, it seems unnecessary to introduce a new variant. In RFC 4122, there are 3 variants described:

Variant #0 (Backward compatibility for Apollo NCS)
Variant #1 (DCE)
Variant #2 (Backward compatibility for Microsoft)

The first variant is the original variant that has the family field. This 8-bit family field did range from 0-255, but only 0-13 were used. That is why the first 4 bits can be used to define the UUID variant. For 0b0___, it is still variant 0, for 0b10__ it is variant 1 and for 0b110_, it is variant 2.

Variant 0 didn't define a version field, but variant 1 and variant 2 did. Variant 2 uses the same versions as variant 1. This means that every new version defined for variant 1, can also be used for variant 2.

However, by using a new variant, lets call it variant 3, it seems only version 7 and 8 could be used inside this variant. I think this isn't how it should be. If you create a new variant, you should make a decision between 2 options:

Make a variant that can use all versions of variant 1 (like variant 2 can use all versions of variant 1)
Make a variant that describes a new versioning. So, the versions in variant 3 are not the same as the versions in variant 1. (But at least you start counting from v1 again)

kyzer-davis commented 2 years ago

@ben221199

only version 7 and 8 could be used inside this variant

There are still 4 bits for version allowing 0 through F or 0 through 15. We are only allocating 7 and 8 in this spec as such 0 through 6 and 9 through 15 are available for variant 111+0.

Make a variant that describes a new versioning. So, the versions in variant 3 are not the same as the versions in variant 1. (But at least you start counting from v1 again)

As for deciding to go strait to v7 and v8: it was just a logical approach in the document to keep things flowing. See my comment above about distinguishing the two variants:

Copied:

[...] UUIDv1ε vs UUIDv1 was my thought originally on how to distinguish Variant 1110/E + Version 1 vs RFC 4122 variant 10xx/89AB + Version 1

My Suggested edits that will make this a bit more clear:

Add the Lower case epsilon (ε) nomenclature to the Draft 03 Variant and Version Fields if it helps drive the point home.
- Anywhere I reference UUIDv7/UUIDv8/Version 7/Version 8 change to UUIDv7ε/UUIDv8ε/Version 7ε/Version 8ε
I can also split Table 2 UUID versions defined by this specification into RFC 4122 Updated version reservations and Draft 03 Version reservations.
- Table A: Define RFC 4122 variant 10xx/89AB + Version 6 then fill out a table up to v15 with "Reserved for future definition"
- Table B: Define Variant 1110/E + Version 7ε and 8ε filling out the rest as "Reserved for future definition"

broofa commented 2 years ago

Add the Lower case epsilon (ε) nomenclature to the Draft 03 Variant and Version Fields if it helps drive the point home. Anywhere I reference UUIDv7/UUIDv8/Version 7/Version 8 change to UUIDv7ε/UUIDv8ε/Version 7ε/Version 8ε

This epsilon-ification of version numbers is just weird. I get that it's convenient for us authors/reviewers to use, but for casual readers it's going to be confusing. Exposing this lingo to users is just going to lead to conversations like this:

"Oh, I didn't know there was more than one version? So what are they?"

"Well, there's versions 1-5, and then there's versions 6ε-8ε"

"I'm sorry... epsi-wat?!?"

And while we're on the subject, I think it's worth pointing out how much larger the audience of casual readers of this specification is vs. people actually writing UUID implementations. (Witness the uuid JS module: 3 maintainers .vs. 11M+ dependent projects) That little bit of confusion is felt by 1,000s of people for every 1 person who bothers to write a UUID implmentation.

I will admit that, as an implementor, this new variant is growing on me. However I still don't feel the scope of this work - at least in its current form - warrants it. I'll try not to rehash the "bit swizzling" side of things, though. I think we all know what is / is not involved there.

Regarding @bradleypeabody's argument that "RFC4122 is a mess, and we should move away from it", I have two concerns. The first is that if that's what we're actually doing, we're kind of throwing the baby out with the bathwater. The only portion of 4122 that needs revamping is version 1. Everything else still stands. Versions 4 and 5, in particular, will continue to be relevant.

Which brings me to my other concern: Unless we explicitly state that we're obsoleting 4122 (e.g. the way RFC9501 obsoletes RFC3501), I don't see how we can claim to be "moving away from it". We are, instead, just extending it, as evidenced by our choice of version #'s. Heck, we say as much in the opening paragraph (emphasis mine):

This document is a proposal to update [RFC4122]

Basically if we're adopting a new variant, we need to do a better job removing the need for people to concern themselves with the old one. Specifically...

[ ] Define 0b111 versions 0-5 as reserved for legacy use (as defined in RFC4122)
[ ] Specify that versions 2 and 3 are deprecated. (I assume? Version 3 certainly is. Version 2 isn't spec'ed in 4122 even, so... who knows?)
[ ] Provide complete definitions for versions 4 and 5, such that readers don't need to reference 4122.
[ ] Insert the requisite XML incantation for obsoleting RFC4122

That 3rd point is the awkward one, and the one that I guess has me struggling to believe we need a new variant. If all we're doing is copy/pasting the text from 4122 - which is probably all that we should be doing if we go this route given we haven't established any need to change either of those versions - .... can we really say we're "moving away from it"?

ben221199 commented 2 years ago

If you want a UUIDv7 and UUIDv8 on the variant 0b1110 (Variant#3), in my opinion you should also define UUIDv1 to UUIDv6 for this variant AND should also define UUIDv7 and UUIDv8 for Variant#2. In this case you use the same "versioning" for both variants. Else, just don't introduce a new variant.

kyzer-davis commented 2 years ago

I will admit that, as an implementor, this new variant is growing on me.

This is my current stance. In writing prototypes, testing and even writing the document this is much, much easier. Meanwhile, for Draft 03 Brad and I are going to at least keep this present so the IETF can give some weight on the topic. If their feedback is negative then I can roll it back in Draft 04. As such, I need to ensure that Version and Variant section of Draft 03 is as clear as possible. I think my suggested edits in my last comment will go a long way to improving the text.

@broofa, I agree, we cannot obsolete RFC4122. Else you get into the scenario you described where we need to define every one of RFC4122's UUIDs and where they stand now. It is a VERY big undertaking and IMO keeping this as an update to RFC4122 continues to be the best strategy.

Version parity in both variants: @fabiolimace, @ben221199, @broofa
I have noodled on this this over the past few months and my compiled thoughts:

For v7/v8:
- Upsides:
- No more confusion about what v7/v8 is for.
- This extends the "do what you want" version to include RFC4122 variant since I know many people already do this and then slap v4 on it.
- Downsides: We eat up 2 versions in 10xx variant which is now half-used leaving 7 more versions in that bit space.
- Counterpoint: We were already going to eat up 2 versions in that variant so maybe this is no big deal.
- Editor Actions: If we do this I need to update v7/v8 sections (and appendix) with ASCII layouts and examples for both.
For v1 through v5:
- Upsides: No future confusion about what version 1-5 is with either variant.
- Downsides:
  1. 10xx variant is 122 bits while 1110 variant is 120 bits.
  2. We just reserved half of the new variant. Leaving them open for future specs is an effective method of utilizing only what we need and nothing more.
- Editor Actions: If I reserve them in the new variant I MUST put text about how to translate all of them to 120 bits. Hence my strategy of leaving it as v7/v8 for only and reserving only what we need.
Version 0:
- In writing my comment yesterday this was the first time I realized there was an undefined version 0. Personally I am okay with reserving version 0 in both variants as free form "do what you want with all remaining non-version, non-variant bits" over calling it version 8.
- Either way we use a version; this is just semantics.
- Editor Actions:
- Modify all UUIDv8/Version 8/v8 text to be UUIDv0/Version 0/v0.
- Add ASCII/Appendix for both variant layouts.

Epsilon usage:

This was only brought up as a way to distinguish the variant E# and epsilon being ε char seems to a tidy way to quickly provide an at-a-glance reference in text. If I get the text in version/variant to convey the point properly we should be able to avoid the confusion about what that is used for.

Edit: Epsilon would not play ball with IETF converter tool. Thus Capitol "E" has been replaced instead.

Notes: Editor Actions are only for me to keep track of what I would need to modify in the draft if consensus was achieved. Reminder: Draft 03 HTML pre-IETF draft can be found here: https://uuid6.github.io/uuid6-ietf-draft/

ben221199 commented 2 years ago

I remember that I read about version 0 somewhere, but I don't know where. I remember something about that it was reserved and only was meant for invalid UUIDs. I also found this: https://github.com/r-lyeh-archived/sole

I see that there is a difference in available bits (120 vs 122) in the new variant. In that case the whole idea of a new variant seems unnecessary. If you still want to introduce a new variant, use the variant for just one format. Don't do any subversioning.

broofa commented 2 years ago

Version 0

Yeah, kind of odd this isn't addressed in the RFC, nor had much discussion here. The only argument against not using it I can see would be to avoid confusion with the Nil UUID. Not a very strong argument, however.

If we switch version 8 to be version 0, I would suggest we have it (version 0) use the 0b10x variant and not the new one. Otherwise explaining which versions correspond to which variants just gets that much more confusing. ("1-5 are 0b10x, versions 0 and 7 are 0b111").

But I think doing that weakens the case for 0b111. We'd be defining a new variant but only using it for a single version (version 7).

kyzer-davis commented 2 years ago

@ben221199 interesting you found at least one implementation already using version 0 for custom implementations.

Most places I checked reference Version 0 as Nil UUID just like @broofa pointed out but technically that isn't correct.
I poked through the RFC4122 Erratas to see if there was anything of value on this topic but there was not.
I also searched through the IETF dispatch mailer and only found a single reference from my collegue but I think version 0 was used as a placeholder.

For the moment I will scratch v8 to v0 from the agenda but when I submit draft 03 I will shoot an email to the dispatch mailer seeing if anybody has other usages of v0 that I couldn't' find.

fabiolimace commented 2 years ago

If the timestamp is 48 bits, I think it is no longer necessary to move the version number to another variant.

|000000000000|7|000|N|000000000000000|

|------------|M|---|N|---------------|
     time      random      random     
               counter                
               submsec                
M: 7
N: [89ab]

The code block below shows a simple function to generate UUIDv7 in Javascript. It uses Math.random() for simplicity. It's not efficient, of course, but it gets the job done.


function hex(number, len) {
    return number.toString(16).padStart(len, '0');
}

function random(bits) {
    const max = Math.pow(2, bits);
    return Math.floor(Math.random() * max);
}

function uuid7() {

    let uuid = "";

    // get hexadecimal timestamp
    let ms = (new Date()).getTime();
    let timestamp = hex(ms, 12);

    // concat timestamp and random
    uuid += timestamp.substring(0, 8);
    uuid += "-";
    uuid += timestamp.substring(8, 12);
    uuid += "-";
    uuid += hex(random(16), 4);
    uuid += "-";
    uuid += hex(random(16), 4);
    uuid += "-";
    uuid += hex(random(48), 12);

    // put version and variant
    uuid = uuid.split('');
    uuid[14] = '7';
    uuid[19] = ['8', '9', 'a', 'b'][random(2)];
    uuid = uuid.join('');

    return uuid;
}

function main() {
    let total = 10;
    for (let i = 0; i < total; i++) {
        console.log(uuid7());
    }
};

main();

Output:

017f4c74-bd17-74cb-9959-ae3b35f5f270
017f4c74-bd1b-7971-a9cf-144713872aa1
017f4c74-bd1b-7f3e-b7e2-b57df2796602
017f4c74-bd1b-7179-b79c-dc3a706a2144
017f4c74-bd1b-7303-a248-0cf396e4559e
017f4c74-bd1c-7ea8-a5ef-c3ad91c1d413
017f4c74-bd1c-79dd-ab45-05469dbba543
017f4c74-bd1c-744d-bd2a-d1ffecb6d853
017f4c74-bd1c-716e-b577-62bf993de8fd
017f4c74-bd1c-7e09-8ce8-6d9e3efe57b1

EDIT: updated the function random() to receive a number of bits as argument.

kyzer-davis commented 2 years ago

@fabiolimace, @ben221199, @broofa @LiosK I gave the Draft 03 Version and Variant section another coat of paint in the latest PR #75. I think it is in a much better spot now but let me know what you think, specifically on the new text, over on #66

broofa commented 2 years ago

~~Google~~Code-golfing @fabiolimace's example using modern JS and BigInt, and including implementations for both forms of variant.

Source on CodePen.

function bigrand(bits, shift = 0n) {
  return BigInt(Math.floor(Math.random() * 2 ** bits)) << shift;
}

function toUUIDString(bignum) {
    const digits = bignum.toString(16).padStart(32, "0");
    return `${
      digits.substring(0, 8)
      }-${digits.substring(8, 12)
      }-${digits.substring(12, 16)
      }-${digits.substring(16, 20)
      }-${digits.substring(20, 32)
    }`;
}

// RFC variant
function uuid7() {
  return toUUIDString(
    (BigInt(Date.now()) << 80n) | // timestamp  
    (0x07n << 76n) | // version
    bigrand(12, 64n) |
    (0x8n << 60n) | // variant
    bigrand(14, 48n) |
    bigrand(48)
  );
}

// New variant
function uuid7e() {
  return toUUIDString(
    (BigInt(Date.now()) << 80n) | // timestamp
    (0x07en << 72n) | // version|variant
    bigrand(36, 36n) |
    bigrand(36)
  );
}

console.log(uuid7());
console.log(uuid7e());

kyzer-davis commented 2 years ago

Update: I am going to shoot for the moon and write up v7/v8 + 8/9/A/B var sections so folks have something to compare against when I submit to IETF. Plus this will make it easier to roll back to old var layout in draft 04 if need be.

Formats

UUID Version 6
UUID Version 7
UUID Version 7E
UUID Version 8
UUID Version 8E
Max UUID

bradleypeabody commented 2 years ago

@kyzer-davis are you planning on putting these in the spec or as an appendix? I worry that regardless of which way this ends up going the spec should clearly propose one way and not both. So maybe write up the UUID 8 and 7 non-E versions as appendices and mention that these are alternates that could be used if the var+ver field idea is shot down. Does that effectively address both of our concerns?

ben221199 commented 2 years ago

I think I should make a complete format-list here:

UUID:
- Variant#0 (0xx) - The legacy UUID by Apollo Computer
- *
  - 32 bits (time_high), 16 bits (time_low), 16 bits (reserved), 8 bits (family), 56 bits (node)
- socket_$unspec (0x0)
- socket_$unix (0x1)
- socket_$internet (0x2)
- socket_$implink (0x3)
- socket_$pup (0x4)
- socket_$chaos (0x5)
- socket_$ns (0x6)
- socket_$nbs (0x7)
- socket_$ecma (0x8)
- socket_$datakit (0x9)
- socket_$ccitt (0xA)
- socket_$sna (0xB)
- socket_$unspec2 (0xC)
- socket_$dds (0xD)
- Variant#1 (10x)
- v1 (0x1) - See RFC 4122
- v2 (0x2) - See RFC 4122
- v3 (0x3) - See RFC 4122
- v4 (0x4) - See RFC 4122
- v5 (0x5) - See RFC 4122
- v6 (0x6) - See https://github.com/uuid6/uuid6-ietf-draft
- v7 (0x7) - See https://github.com/uuid6/uuid6-ietf-draft
- v8 (0x8) - See https://github.com/uuid6/uuid6-ietf-draft
- Variant#2 (110)
- Used in Microsoft DCOM as Interface ID; could not find any format description so far
- Variant#3 (111)
- Unused

The family field is 8 bits, but because only values 0 (0b0000) to 13 (0b1101) are used, the first 4 bits can be used to indicate the variant. When the family field starts with a 0-bit, it means everything is still legacy UUID, so that means that values 0 (0b00000000) to 127 (0b01111111) can be used as family, where 0 to 13 are already allocated. When the family field starts with a 1-bit, it is definitely not legacy UUID.

Notice that every variant has its OWN subtyping. Variant#0 does use families. Variant#1 does use "versions". Think good before deciding to open up a new variant. If you open up a new variant, you have 4 options:

Use subtyping of another variant.
- BAD IDEA, because Variant#1 has 1 more bit available than Variant#3. Many "versions" are not compatible with Variant#3.
Introduce a new subtyping. Variant#0 has families, Variant#1 has versions, so Variant#3 could have "structures" for example.
- BETTER IDEA, because you can create new "structures" and also start counting from 0 or 1 (like v1 in Variant#1).
Don't use subtyping at all. Your variant will only have one format, it seems that Variant#2 is like that.
- HMMMM, you introduce a new variant, but uses it for 1 type. Seems a waste.
Don't open up a new variant. Just stay with the normal versioning in Variant#1.
- BEST IDEA, because why do you want to open up a new variant? Variant#1 isn't even halfway full.

ben221199 commented 2 years ago

So what things are now available:

Variant#0:
- Family 14-127; for example for adding IPv6 (socket_$internet6)
Variant#1:
- Version 0
- Version 6, 7, 8 (but will be used after publication of this spec)
- Version 9-15
Variant#2:
- Unknown
Variant#3:
- This whole variant is reserved for later use. So everything could be done with it.

kyzer-davis commented 2 years ago

Group,

I made some large changes in #85 to introduce both v7/v8 and v7E/v8E as I stated in https://github.com/uuid6/uuid6-ietf-draft/issues/26#issuecomment-1061175609

Please give it a review and let's keep discussing that text here.

uuid6 / uuid6-ietf-draft

Discussion: Redefine variant bit (111) definition #26

Formats