uuid6 / new-uuid-encoding-techniques-ietf-draft

New UUID Encoding Techniques
4 stars 1 forks source link

Discussion: Variable Length UUIDs | UUID Long #2

Closed kyzer-davis closed 2 months ago

kyzer-davis commented 2 years ago

All things Variable Length UUIDs!

Biggest Question up front: Should this be in scope for this RFC Draft?

If so as per @broofa

peterbourgon commented 2 years ago

I think the more common name for "variable-length UUID" is "string" :)

kyzer-davis commented 2 years ago

Throwing my 2 cents in:

Length

Q: Are we sticking with 128-bits or are we introducing variable-length.

Sources:

broofa commented 2 years ago

Should this be in scope for this RFC Draft?

No. Not enough real-world use cases, evidence of need, or experience among contributors to merit including this.

bradleypeabody commented 2 years ago

I think the more common name for "variable-length UUID" is "string" :)

Agreed. But this is also kind of the whole point, see next.

Not enough real-world use cases

Let's break it down like this: We're making a new UUID spec. One of the whole points of this is to make it useful for database identifiers. I'm hoping that we will see things such as SQL statements like INSERT INTO ... VALUES ( NewUUID7(), ...) (or their crazy NoSQL or whatever language equivalent) - where databases will eventually add native support with functions etc. This was the original problem that drove the UUIDv6 idea in the first place.

So the question becomes: is there a use for case for supporting NewUUID7(20) (instead of the default which would use 16 bytes/128 bits). To me, that's one of the main things we're trying to answer with this.

Larger than 128 bits is not required from a collision avoidance perspective and adds extra unneeded data to the application/database/wire/etc

If this is really the case and it's not needed, then fine, maybe it's a moot point. I admittedly haven't studied the collision probabilities carefully enough to have a firm opinion on it. But I just don't see how we can say that one degree of collision resistance is "good enough for everyone", that's the point I'm stuck on. Maybe an idea here would be to get a list put together of collision probabilities at various lengths. I still don't know how to convert "1 in (some huge number)" into "good enough for most applications" - I'm open to ideas on how to evaluate this.

The other questions brought up, while valid, do have straightforward answers:

If a UUID is larger than 16 bytes, how is it padded?

It's not, the thing holding the value is responsible for knowing it's length.

... what are the extra bits / bytes used for?

Entropy/collision resistance

... can a UUID have a fractional-byte length? (e.g. 173 bits) ... how would that be parsed?

No. Just like lengths in most systems are specified in bytes, we would limit to byte-aligned lengths.

What does the string-form of a non-standard length UUID look like? a.k.a "Where do the hyphens go?"

Add one more hyphen after the 16th byte and then no more in the ensuing digits.

What does it mean for a uuid to have length < 16 bytes?

It's just a shorter form with less entropy. I'm not stuck on this one, if we added the ability to make longer UUIDs but not shorter ones, that would at least handle the "is this enough entropy" concern, which is the primary one.

As an implementor, when scanning a UUID field, how do I know how many bytes/bits to read? I.e. how does a UUID indicate it's length? This is important, if for no other reason than to know where the next field in the data starts.

It depends on protocol using the UUID. If there is an existing protocol/format which expects to transmit a UUID as exactly 16 bytes well than yeah that's not going to work with variable length. But a lot of transports already have a separate encoding mechanism for the length. (I.e. if it's in a field in the database then the database already keeps track of how many bytes that field is, since it would be a string or a blob - JSON has delimiters, msgpack has a length number, and so on) So again, I understand why some applications wouldn't need or want variable length UUIDs and couldn't use them. But is there really a reason to explicitly make NewUUID(20) "invalid"? - even though databases could easily handle this because they have a means of storing the length outside of the actual value.

What does variable field length imply for other, existing RFC versions? "Why can't those now be longer?"

Not our problem. RFC4122 is fixed length and that's fine. UUIDv8 allows people to do whatever they want. I think there is enough flexibility here for people to do what they need. If someone needs some specific variable length UUID that isn't UUIDv7, they just use UUIDv8.

LiosK commented 2 years ago

+1 to the opinions that the variable length stuff should NOT be in scope for the new RFC. But if it were:

I don't think the shorter UUID is feasible. Short ID implementations usually utilize application-specific shared knowledge to ensure the uniqueness, while the UUID specs are not designed so. It is unlikely possible to ensure practical collision resistance by simply shortening the 128-bit UUID versions.

UUID is not an all-in-one ID standard but just a universally unique ID standard; it will be just confusing to include a spec that will never be universally unique.


IMO, the longer UUID spec is not necessary until 128-bit length is proven to be insufficient and many applications seek for longer IDs. At that time a new standard will be necessary to coordinate many applications and libraries, but in the meantime, an application, if concerned about collisions, can simply extend UUID on their own, for example, by utilizing the following struct:

struct MyUniqueId {
  uint8_t uuid[16];
  uint8_t ext_entropy[8];
  uint8_t machine_id[8];
};

I have never seen this kind of approach employed to append extra entropy, but I think it is often common to add a type or namespace tag at the application level to discriminate UUIDs (e.g. user/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx and item/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx). Therefore, I think the new RFC can discuss this kind of approach for those who need extra collision resistance, instead of specifying a variable length UUID standard.

It would be valuable if the new RFC includes a section that identifies the source of collision resistance and its limitations and suggests the possible approaches to mitigate such limitations.

fabiolimace commented 2 years ago

I am of the opinion that the variable length UUID should be out of scope.

@LiosK, I couldn’t agree with you more.

bradleypeabody commented 2 years ago

Understood on the above points, and that a fair number of people dislike this idea.

How about this as an approach that would significantly simply the whole thing and both remove variable length from the scope of UUID implementations, but also provide guidance to those applications that absolutely must have more entropy because 128 bits is not enough. How about this text such as this for spec:

UUID as a Source of Opaque Bytes or Strings

For many applications, UUIDs are treated as an opaque (i.e. never parsed) sequences of bytes. Or the text form is used as an opaque string of characters. The purpose being solely to uniquely identify a resource, and its exact form is not otherwise relevant.

In these cases, if an application wishes to emit longer random byte sequences or strings in order to add more entropy, there is nothing stopping them from doing so. Such outputs are not covered by this specification, and as such, library and application implementors are not obligated to support them.

If used, applications should take care not to call such values "UUIDs" as they are not. E.g. the more generic term "ID" or simply "unique string" would be more appropriate.


In this way, nobody reading the spec and trying to implement a UUID library is obligated to anything with this at all. UUIDs are not variable length. Parsers (which I believe should be minimal anyway) don't have to deal with it if they don't want to, and so on. However, for those people are who are like "No you don't understand, I absolutely must have more entropy for my application and I don't want to invent something new", they can just turn the UUID generator into a generator for whatever length they want. This handles my concern and I removes the complexity that would have been added if "variable length" were specified as a feature. It also means that a database implementor could make something like "NewID(20)" work and output something in the form of UUIDv7 plus more bytes, if they felt it necessary and they wouldn't have a "non-standard implementation", they are just doing what the above section says.

broofa commented 2 years ago

@bradleypeabody Such verbiage serves no purpose other than to confuse things. It's just saying that UUIDs can be embedded as part of a larger data structure and that such data structures should not be considered UUIDs.

This has always been the case. I think it's generally understood that all standards can be (ab)used in this way should implementors choose to do so. There's no need to encourage it by actually talking about it.

If we're going to say longer-form UUIDs are out of scope we should just omit all discussion of them. [Edit: ... other than to say, "Longer UUIDs were considered by the authors but were not deemed to be worth addressing at this time."]

bradleypeabody commented 2 years ago

Such verbiage serves no purpose other than to confuse things.

This question of "so I'm implementing a 'make me an ID function, what should I do?' is quite literally the use case that drove this project in the first place.

And on this same line, "Should we allow longer IDs for applications that need more entropy/collision resistance" is a very relevant question that has come up many, many times. Here on GitHub and also in prior discussions on the IETF mailing list (if someone is interested I can try to go dig up the links to these things).

You will notice there is an overarching theme to the current draft where concerns that are only applicable in certain cases were removed from the actual specification and instead turned into a short section to discuss the topic, as guidance to the implementor. In my mind this falls into that category. Even if longer values are "not a UUID", it is still a relevant concern to people reading the document. Because the goal is not just to "make a new UUID format", but also to answer the question "how should I generate unique IDs in a distributed environment".

I doubt I will be able to convince you personally of the importance of this, but hopefully the above makes it clear that this subject is in fact relevant to other use cases and users. And considering it adds no implementation requirements, and clearly addresses a concern that I've been asked about easily a dozen time in the course of doing all this, I think it's relevant enough to include.

If you have ideas of how to improve the wording, suggestions are welcome.

ben221199 commented 2 years ago

Seems to me that if you talk about UUID, you talk about 128 bits. Imagine the work is needed to update all that software that just made an 128 bit field. Also for me, variable-length UUID is out of scope. I you want a UUID with variable length, it would be better to create a new type of identifier. For example, lets introduce the UOID (Universal Object ID). Everyone talking about UUID knows about 128 bits and everybody talking about UOID knows that it follows a different standard that could possibly have a variable-length version.

broofa commented 2 years ago

@bradleypeabody My concern is that we risk sending a mixed message. For example, in the proposed text you say:

applications should take care not to call such values "UUIDs" as they are not

... which I agree with 100%. But your explanatory text seems to suggest otherwise (emphasis mine):

It also means that a database implementor could make something like "NewID(20)" work and output something in the form of UUIDv7 plus more bytes, if they felt it necessary and they wouldn't have a "non-standard implementation", they are just doing what the above section says.

A UUID is 128 bits. If an ID is longer or shorter, or composed in way that does not comply with the specification, it should not be considered standard, or even "not non-standard".

an overarching theme to the current draft where concerns ... turned into a short section to discuss the topic

I actually quite like this section. Nor am I averse to addressing the "should IDs be longer" topic there. As you say, they're a much-discussed topic. I just think it should be done in a way that is unambiguous about the fact longer ids are not part of this standard.

suggestions

"Since you asked..." 😄

Longer UUIDs, Composition With Other Data

There has been much debate on the utility of longer forms of ids that provide additional uniqueness guarantees, or that allow for encoding additional information. Such ids are deliberately not addressed by this specification, as it is felt the requirements are too specialized to be effectively addressed at this time.

That said, no prohibition is made that prevents applications from using a longer form of ID that combines a UUID with other data. Such constructs should be considered non-standard, however, and care should be taken not to refer to them as "UUIDs". In such cases, applications are encouraged to use a more generic term such as "ID" or "unique string", or invent a new term so as to avoid confusion with standard, 128-bit UUIDs.

sergeyprokhorenko commented 2 years ago

There has been much debate on the utility of longer forms of ids that provide additional uniqueness guarantees, or that allow for encoding additional information. Such ids are deliberately not addressed by this specification, as it is felt the requirements are too specialized to be effectively addressed at this time.

@broofa You have destroyed the main purpose of this specification: Introduce new UUIDs which make good database keys. Good database keys should be long enough to contain the metadata, but not at the expense of the random part. Otherwise, developers will have to accompany the UUID with additional fields, which is bad for the database architecture. A length of 160 bits is absolutely necessary. By the way, 160 is divisible by 5, which is convenient for Crockford base32.

Your fundamentalist position will result in just one more standard that many will have to give up. It seems that the bad example of UUID taught you nothing. ULID and similar identifiers appeared due to the fact that the authors of the standard did not pay attention to the needs of the database architecture. Competing standards will emerge that will bypass your artificial length limitation. Is this what you want?

By the way, I don't like the very term Variable Length UUID. It seems to imply that you can lengthen or shorten the UUID. But it's not. I prefer the more precise term UUID variants of various lengths.

broofa commented 2 years ago

You have destroyed the main purpose of this specification... taught you nothing

Personal attacks are not warranted. Nor do they help convince people of your argument.

A length of 160 bits is absolutely necessary.

I'm not convinced this has been established.

For example, a database with 3.3 quintillion v4 UUIDs has a one-in-a-million chance of collision (P = 0.000001%). In real world terms, such a databaseset will be HUGE, even by modern standards. Assume a 50-200% overhead for an index and a very conservative 100 bytes per row of other data and we're talking about a 400+ petabyte database.

Now before anyone cries foul over how contrived such examples are... trust me, I know. "What if you need P=0.0000000001%?"... "Such databases exist!!"

My point is not that longer IDs aren't needed. It's (ironically) that we have to contrive use cases for them. We're not the only ones. To my mind, this is clear evidence that we don't understand the problem space well enough to be authoring a specification for them. For example, @sergeyprokhorenko, why are we so convinced 160 bits is sufficient? Why not 256? Or 1024?

In 2005, when the RFC was authored, the idea of a system capable of guessing hashes at a rate of 24 x 1018 / second was just idle speculation. The stuff of science fiction. But Bitcoin was invented just 3 years later, and was requiring exactly that level of computation in 2018 when the article in that last link was written. And now, four years later, that hash rate has increased another order of magnitude.

This is why I've been so persistent in my resistance to changing the variant, by the way. I foresee much bigger, sweeping changes to what UUIDs look like in the future. As much as I'd like to be contribute to authoring a spec that will last into the next century, I simply don't have the hubris to believe what we're doing here will come anywhere close to that.

To my mind, we should limit our efforts to the problem space we understand. Meaning, primarily, providing a form of ID that fits within the parameters within which most of the UUID-using community operates, with easily identifiable (and justifiable) improvements.

LiosK commented 2 years ago

I foresee much bigger, sweeping changes to what UUIDs look like in the future. As much as I'd like to be contribute to authoring a spec that will last into the next century, I simply don't have the hubris to believe what we're doing here will come anywhere close to that.

Couldn't agree more.

sergeyprokhorenko commented 2 years ago

@broofa You are confusing database table keys with cryptographically strong access tokens. They have completely different purposes. Database developers don't care about key guessability at all, and in most cases they get by with auto-increment.

The length of 160 bits comes from real practical needs for metadata in keys. I am a systems analyst with many years of experience in many of the world's leading banks and companies, and unlike you, I don't have to invent these needs. I see them firsthand. And I would be sorry for my time wasted discussing a useless standard. If, as you yourself admit, you do not understand the problem, just trust a professional.

You are also trying to save on key lengths while degrading the database architecture. But a greedy man pays twice or even more.

You refer to the non-existent experience of using UUIDs as database keys in abstract community. This argument is worthless. I actually have experience with UUIDs as database keys. It was very convenient, very slow, and there was a severe lack of metadata in keys.

sergeyprokhorenko commented 2 years ago

If you have ideas of how to improve the wording, suggestions are welcome.

@bradleypeabody I like your idea of describing prospect IDs for databases as a combination of UUID + other data in one field of a database table.

I can offer the following concise wording:

The surrogate key MAY be a concatenation of the UUID followed by an additional random part and metadata. By default, its length is 160 bits.

LiosK commented 2 years ago

@sergeyprokhorenko Interesting. The discussion so far has primarily focused on the extra entropy to guarantee collision resistance. Subsuming metadata under a surrogate key sounds like a very different story. What do the world's leading banks and companies exactly do with such a data structure? What kind of metadata do they embed? What kind of data items do such keys refer to? Why do they not use a composite key?

broofa commented 2 years ago

broofa You are confusing database table keys with cryptographically strong access tokens.

Am I? From RFC4122, §6:

Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access),

sergeyprokhorenko commented 2 years ago

Interesting. The discussion so far has primarily focused on the extra entropy to guarantee collision resistance. Subsuming metadata under a surrogate key sounds like a very different story.

@LiosK It's not new in this discussion. See example

I would suggest multiple types of metadata:

What do the world's leading banks and companies exactly do with such a data structure? What kind of metadata do they embed? What kind of data items do such keys refer to?

They use auto-increment, or UUID v4, or non-standard time-based UUIDs like in Laravel + checksum, or surrogate keys that have the following structure: operation type code or other code + date + sequence

Why do they not use a composite key?

They use composite keys very extensively and suffer greatly from this.

peterbourgon commented 2 years ago

Database developers don't care about key guessability at all

I'm not sure you can make a generalized statement like this. Some database folks care about guesasbility, some don't.

the main purpose of this specification: Introduce new UUIDs which make good database keys.

I don't think that is the main purpose of this specification. If it is, where is it stated?

sergeyprokhorenko commented 2 years ago

the main purpose of this specification: Introduce new UUIDs which make good database keys.

I don't think that is the main purpose of this specification. If it is, where is it stated?

@peterbourgon Here is the proof

peterbourgon commented 2 years ago

That indicates that "good database keys" is a goal, but it doesn't indicate that it's the main purpose of the specification.

sergeyprokhorenko commented 2 years ago

Is there any other reason for generating monotonic UUIDs? If you look at the selected prototypes of this specification, then you will no longer have doubts.

peterbourgon commented 2 years ago

There are tons of reasons to generate [monotonic] UUIDs that have nothing to do with databases. I've used ULIDs in many situations where they were simply record locators in files, or file names.

broofa commented 2 years ago

record locators in files, or file names

How is monotonic behavior useful in such cases? I.e. as opposed to just using a v4 UUID (or ids of similar ilk.)

peterbourgon commented 2 years ago

The progenitor of oklog/ulid is oklog/oklog, which leverages ULIDs to assign (roughly) monotonic identifiers to each ingested log record. Those IDs establish a deterministic global order which is leveraged as an invariant at many points throughout the system, including as names for segment files (LO-HI.txt) which thereby become self-describing as to their contents, and naturally sortable.

kyzer-davis commented 2 years ago

It has been a minute since I checked this topic... I see a lot of good discourse has transpired since I re-opened the thread!

Putting my bullet of notes below. Let me know if I missed any key points that would influence text I write.


Also, from what I can see discussion has mostly been on longer UUIDs. I have been doing some supplemental research and want to provide some data on shorter UUIDs which I (and others) have deemed Locally Unique Identifiers (LUIDs).


Scope:

Locally Unique Identifiers (LUID)

UUID Long


Generic Format notes

- UUID Long:    8-4-4-4-12-any | Len: 128+
- UUID:         8-4-4-4-12     | Len: 128
- LUID Long:    10-4-4-4-10    | Len: 128
- LUID Short A: 10-4-4         | Len: 72 (When truncating LUID to variant byte.)
- LUID Short B: 8-4-4-2        | Len: 72 (When truncating UUID to variant byte.)

UUIDv8E Example as all types

- UUID-L: xxxxxxxx-xxxx-xxxx-E8xx-xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- UUID:   xxxxxxxx-xxxx-xxxx-E8xx-xxxxxxxxxxxx
- LUID:   xxxxxxxxxx-xxxx-xxE8-xxxx-xxxxxxxxxx
- LUID-A: xxxxxxxxxx-xxxx-xxE8
- LUID-B: xxxxxxxx-xxxx-xxxx-E8

UUIDv4 Example as all types

- UUID-L: xxxxxxxx-xxxx-4xxx-Axxx-xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- UUID:   xxxxxxxx-xxxx-4xxx-Axxx-xxxxxxxxxxxx
- LUID:   xxxxxxxxxx-xx4x-xxAx-xxxx-xxxxxxxxxx
- LUID-A: xxxxxxxxxx-xx4x-xxAx
- LUID-B: xxxxxxxx-xxxx-4xxx-Ax

LUID Research


Edit 3/8/2022: LUID was something I came across and felt like documenting. I will work on UUID Long for this document however I may reach out to Tableau engineering and see if they are interested in a future LUID spec.

kyzer-davis commented 2 years ago

Group,

UUID Long has been implemented in uuid6/uuid6-ietf-draft#85. Please review and let me know what you think. For any feedback let's discuss here.

broofa commented 2 years ago

@kyzer-davis I missed your comment from a few days ago with the proposed text. My apologies for not paying attention.

I will admit this whole UUID Long section caught me by surprise. I felt (hoped) that we'd settled this as being out of scope. Be that as it may, we've officially crossed my threshold for what does / does not qualify as an extension to RFC4122.

Between "UUID Long" and the proposed alternate text encodings (#2 ), I can't support the spec in the proposed form. The number of possible permutations for UUID forms is too large to qualify as a "Standard". We're just enumerating how a variable number of bytes may be represented in a variety of encodings. This is not helpful. It will cause more problems than it solves.

For example, if a system receives a UUID Long encoded in an unknown base, how are the base and length determined? Is there even a canonical solution to that? (I believe @bradleypeabody has raised this concern previously.)

If we're going to continue down this path we need to simplify things to bring this back to the level of specificity that I believe a Standard demands. We should create a new RFC that drops the 8-4-12 notation, pick one (and one only!) new encoding, I guess adopt the new variant, and deprecate 4122 altogether.

sergeyprokhorenko commented 2 years ago

@broofa,

The complexity of the tool have to correspond to the actual complexity of the subject area. Developers cannot get by with a stone ax for all occasions. The variety of formats is caused by a variety of needs, and is not a thoughtless combination of possible components.

broofa commented 2 years ago

@sergeyprokhorenko: Please convert this UUID to its 8-4-4-4-12 hex form:

[Edit: screwed up the encoding 😦 ] NmExZDU3NGJiMzBhLTQyZjYtYjY1MS0wYTdhLWM4YjA0ZTZiNGJlOC1jMzkzYmViMmY=

nu1KF7ZXDin2HDc6lYOBn0ibj5B+fij=

This should be, must be, a trivial problem if the spec is well-formed. If it's non-trivial (which I obviously think it is or I wouldn't be posing this question) then I assert this effort has gone astray and we need to rethink what exactly it is we're trying to accomplish here.

The complexity of the tool have to correspond to the actual complexity of the subject area

... until it doesn't.

sergeyprokhorenko commented 2 years ago

Please convert this UUID to its 8-4-4-4-12 hex form:

It's not a UUID at all, because the encoding is not Crockford's base32. There are many transcoding libraries out there, so leave that up to the developers.

peterbourgon commented 2 years ago

Well, a UUID doesn't have to be encoded in Crockford base32 to be a UUID — any encoding is fine, as long as the (decoded) bytes satisfy the relevant specification. But, with that said, 8-4-4-4-12 hex is already an encoding, so if you base64 encode that I guess you've gone one encoding too many 😉

However I do agree with the underlying point made by @broofa. Variable width types are enormously less efficient to parse than fixed width types: parsing one requires two phases and at least one conditional, and parsing a sequence of them is O(n) rather than O(1).

sergeyprokhorenko commented 2 years ago

@peterbourgon Nobody talks about dynamically variable width, which is different in different records of the table. It's just that for different purposes there should be UUIDs of different widths.

peterbourgon commented 2 years ago

@sergeyprokhorenko

@peterbourgon Nobody talks about dynamically variable width, which is different in different records of the table. It's just that for different purposes there should be UUIDs of different widths.

So uuid6/uuid6-ietf-draft#85 defines UUID Long as

UUID Long: A generalized name for any variable length UUID longer than 128-bits.

`8-4-4-4-12-<any_number_of_hex_characters>`

Which means that parsing a UUID Long requires the consumer to know how many additional bytes have been appended by the producer. As far as I can see, that makes a UUID Long a type with variable width.

Typically the approach for types like this is to encode the width preamble or header to each value. But if that's not the case here, and the number of additional bytes are expected to be communicated out-of-band somehow, then UUID Long by itself isn't actually useful, because it doesn't contain enough information to enable end-to-end communication. You would need to declare that a field is of type "UUID Long 10", or "UUID Long 48", which is equally well expressed as "UUIDvN plus 10 bytes" or "UUIDvN plus 48 bytes" etc.

sergeyprokhorenko commented 2 years ago

@peterbourgon

parsing a UUID Long requires the consumer to know how many additional bytes have been appended by the producer.

It's right.

As far as I can see, that makes a UUID Long a type with variable width.

This is not true. All parties need to know the approved length of the UUID before use of the information system. The length should not change from time to time. There is no need to report the length of each UUID.

You would need to declare that a field is of type "UUID Long 10", or "UUID Long 48", which is equally well expressed as "UUIDvN plus 10 bytes" or "UUIDvN plus 48 bytes" etc.

Yes, that's right.

peterbourgon commented 2 years ago

If UUID Long must be combined with a width in order to be usable, then there's no reason for UUID Long to exist.

sergeyprokhorenko commented 2 years ago

If UUID Long must be combined with a width in order to be usable, then there's no reason for UUID Long to exist.

@peterbourgon, There is no logic in these words

broofa commented 2 years ago

@sergeyprokhorenko I think @peterbourgon's point is that If all participants have to agree on the width of a UUID in order for it to be usable, then they can just as easily agree on other, better ways to share whatever information is encoded in the extra bytes. (E.g. composite data structure or separate DB column).

UUIDs should be completely self-contained, with no need to consult any outside source (other than this spec) to understand how they are created, used, or parsed. This is what gives them their utility and uniqueness. As soon as you require applications to coordinate, to share other information about them, this specification quickly stops being useful.

[Edit to add: And if the extra bytes are strictly there to insure uniqueness, which is the only actually-valid reason for extending a UUID imho, then we're just looping back to the debate about what should / should not be addressed in a new RFC.]

peterbourgon commented 2 years ago

If all participants have to agree on the width of a UUID in order for it to be usable, then they can just as easily agree on other, better ways to share whatever information is encoded in the extra bytes. (E.g. composite data structure or separate DB column).

Right. I can't say a column is of type UUID Long, I have to say it's UUID Long + 10 or UUID Long + 48 or whatever. But if I can define a type as UUID Long + 10 I can just as easily define it as UUIDv6 + 10 with totally equivalent results. So why define UUID Long at all? It doesn't accomplish anything.

sergeyprokhorenko commented 2 years ago

I suggest wording into section "DBMS and Database Considerations", which will probably suit everyone. It replaces UUID Long.

Key field (surrogate key) of database table or log message identifier MAY contain UUID concatenated to the right with hyphen and additional random segment and/or multiple metadata, such as entity type or database table code, namespace, shard or partition code, data source code, operation/message type code, UUID/field/identifier checksum and/or other application specific items.

kyzer-davis commented 2 years ago

Group, great discussion, this is why I author the proposed RFC text!

Converting form GitHub threads to "RFC Speak" always drives great conversations and uncover things that may not have been considered. PR uuid6/uuid6-ietf-draft#85's text has been written in a way that I can easily remove E Variant, Alt Encoding, and UUID Long or transpose that XML structure to an alternate Draft that focuses on these topics.

That being said, we have a few engineering challenges with these sections but I am confident this group will be able to derive a great solution! I reviewed the last comments and I will summarize a few of the topics as usual.


Signaling UUID Alt Encoding Method(s)

@broofa makes a good point on this topic, how do we determine the method to unpack this? I believe the point is also relevant for regular UUID + alt encoding. If System A creates a UUID / UUID Long, encodes it with a random method and sends it to System B as urn:uuid:<encoded_uuid> how does system B determine how to decode that UUID?


Signaling UUID Long Length

I'll be honest, this one slipped my mind and should have been included with my discussions here. We absolutely need a method to signal the length of the UUID Long. I have two possible methods for solving this problem:


General URN Author note, I would need to do a deep dive on RFC8141 to ensure any potential URN proposals are valid syntax before authoring text.

Editor Note: Must include text about assuming urn:uuid: implicitly equals urn:uuid:base16hd:128: for backwards compatibility reasons. base16hd - Base16 UUID with Hex + Dashes.

sergeyprokhorenko commented 2 years ago

I wouldn't hardcode the signaling data identifiers into a UUID or into a message, but instead use links, alike links to imported libraries and functions.

In addition, signaling data may only be needed when forwarding messages (similar to checksums). They should not be stored in the database, as they greatly increase database lookup time, especially if the signal data precedes the UUID in the key.

A JSON-like envelope would be useful, in which the UUID will be sent. Inside this envelope will be checksums and information about the encoding. At the same time, some metadata must be stored in the database key along with the UUID, and cannot be assigned to separate JSON fields.

kyzer-davis commented 2 years ago

Announcement

I had a great discussion with @bradleypeabody and this topic has officially been marked out of scope for Draft 03 (and any future draft.) The XML text is retained and over the next few weeks I will author a separate Draft 00 which includes this topic specifically.

For now please focus on the technical challenges proposed by my previous comment: uuid6/new-uuid-encoding-techniques-ietf-draft#2

Edit: To further clarify, Draft 03 will cover UUIDv6 through v8 + Max UUID. The new Draft 00 will cover E Variant, Alternate Encoding and UUID Long. Two drafts that cover different topics so implementations may choose what they want to support. i.e An implementation supports RFC8675309 for v7 but not RFC123456789 for alt encodings.

kyzer-davis commented 2 years ago

Group,

Following the previous announcement I have drafted up a new RFC Draft document to cover this topic.

Since the discussions threads are in this repo I decided to create new folder for the topic. Github Pages picks up the folder nicely and thus Draft 00 of "New UUID Encoding Techniques" can be found here:

Additionally, I have authored an "Extended UUID URN namespace" for conveying encoding type and length of a UUID to other applications defined by this document. I still have more research on URNs to do but I feel confident enough the proposed URN is backwards compliant with RFC4122 and also compatible with RFC8141.

martinheidegger commented 2 years ago

It is curious to me why you chose

urn:uuid:73E94FE0-E951-4153-AAF3-50E4E6089D9D:128:base16hd

over

urn:uuid:base16hd:128:73E94FE0-E951-4153-AAF3-50E4E6089D9D

in the 00 draft as it seems the latter one would be the better choice to me.

kyzer-davis commented 2 years ago

@martinheidegger,

I toyed with both for a while in draft 00. Personally I prefer the latter as well. However for left-right parsing and compatibility reasons I felt keeping the format of urn:uuid:{uuid_value} at the start made the most sense. After-all we are extending what currently exists thus a post-fix of :{uuid_length}:{uuid_encoding} seemed logical for that approach.

Final: urn:uuid:{uuid_value}:{uuid_length}:{uuid_encoding}

And for placing {uuid_length} before {uuid_encoding}: This stemmed from how I was describing the URN examples it in the text. e.g. "..128 bit, base32 encoded UUID...". I have no reservation for either I just want to ensure the text I author is consistent with the URN format to promote readability.

Lastly, this allows for future extension if need be. Say for example somebody wanted to extend further and describe :{uuid_variant}:{uuid_version} they can easily define a spec that post-fixes these without overly complicating the existing structures.

Edit: The variable length of the {uuid_value} isn't as much a problem if URN is split by the colon character : since the doc says any UUID Value with a colon should be percent encoded which helps this.