uuid6 / new-uuid-encoding-techniques-ietf-draft

New UUID Encoding Techniques

4 stars 1 forks source link

Discussion: Alternate Text Encoding Methods (Crockford's Base32, etc) #3

Closed fabiolimace closed 3 weeks ago

fabiolimace commented 4 years ago

I think the 'base32hex' encoding, decribed in the section 7 of the RFC-4648, should be used instead of creating a new one. I know that the algorithm is the same with a different alphabet, but the 'base32hex' already exists for the same purpose.

bradleypeabody commented 2 years ago

So if we don't change the default format, what about something like this (rough paraphrase of text section):

Text Format

The text format (indicate 8-4-4-4-12 format from RFC4122) remains the default format. The default string conversion and parsing functionality for UUIDs remains as is.

Additionally, implementations MAY choose to implement a format which we'll call "compact" and is Crockford Base 32 (indicate alphabet and no padding). This format has properties which make it more compact than the default hex format, but also due to not containing punctuation or case sensitivity still makes it useful in a variety of situations such as file names, URLs, DNS records, and also sorts the same as the binary form. Since such values do not contain the "-" character they can be reliably discerned from the default hex format if needed.

Furthermore, it is pointed out that there is no restriction on text encodings used by applications which have other needs. For example an encoding of any base between 2 and (?) can be implemented using the the algorithm shown in Appendix N(link).

If did something like the above, there is zero obligation for library authors, we still introduce a recommendation for a "compact" format so people wanting this have guidance and a recommendation (instead of everyone having to then guess which base might be good for them and reason about all of the tradeoffs described above), and also point out that nobody is limited to this and even provide example code with maximum flexibility for those that need it.

Also, if people hate the name "compact" we could just refer to it as "Crockford Base 32", although something shorter would be useful for function names, e.g. maybe "b32c".

daegalus commented 2 years ago

I will chime in, as a library author, I would prefer to do a little extra work to bring UUID into a modern format as the one and only format, than backwards compatibility with an inefficient old format.

Now, I understand I am a unique case, as I have some unpopular opinions on versioning and support, so maybe we should hear others chime in. But, anyone adding support for UUID6/7/8 will already have to do work, adding an encoding method isn't that much extra work.

But anyways, just thought I would chime in and offer a view point of a UUID library maintainer.

LiosK commented 2 years ago

@bradleypeabody, I understand the temptation to get rid of the ugly old format, but changing the default format might be considerably difficult (perhaps much more difficult than introducing the new var) for existing libraries because it breaks almost all of dependent code. But anyway this is a variation of broader "replacing or extending the original RFC" questions than just the text format issue.

Another topic I'd hear voices is the single vs. multiple alternative format(s) problem. I believe the spec should recommend a base-62ish format in addition to a base-36ish format because of the fact that a considerable number of UUID alternatives did choose base-62ish formats. I think by providing multiple alternatives the spec will be able to satisfy wider needs at a reasonable expense, because it is easy to implement base-62 if a library already implement base-36 or similarly it is easy to implement base-64 if a library already implement base-32. But I would hear honest opinions for this because many people here seem to be against this idea.

For the avoidance of doubt, I would not support an any-base any-alphabet approach. I believe the spec should recommend a couple of alternative formats with specific bases, alphabets and algorithms, so that libraries can work together using the specific formats.

kyzer-davis commented 2 years ago

Group,

I implemented the first pass of Alternate encoding section in uuid6/uuid6-ietf-draft#85. Please let me know via this thread if I missed a key point or if I need to further iterate on some text.

Also, from here out please use the UUIDv4 example in the text for all base proof of concept work. This way we are consistently using the same UUID to conversion for illustrative purposes. UUIDv4 Base Example: 73E94FE0-E951-4153-AAF3-50E4E6089D9D

@LiosK, @fabiolimace please double check my b32 and b36 example included in the draft PR.

Note: I am not against including others like b58, b62, etc. Somebody give the conversion of that UUIDv4 and I will include it. We could even have a neat table in the appendix with that UUIDv4 converted into any ol' base so implementations can quickly verify their conversion/deconversion is working properly.

broofa commented 2 years ago

@bradleypeabody Your proposed verbiage strikes a good balance.

That said, I would expect uuidjs to take a conservative approach here. One of the issues that project has is that it assumes the primary, preferred form of UUIDs is the 8-4-12 format. The only way you get a binary form is via the parse() method, which is a relatively new addition to the API. Adapting it to work with other encodings (while also dealing with tree-shaking concerns) will require some thought.

For example an encoding of any base between 2 and (?) can be implemented using the the algorithm shown in Appendix N(link).

I would omit this. It's not relevant enough to warrant inclusion. At most, provide a reference implementation for the preferred Crockford-32 form.

@liosk writes:

I believe the spec should recommend a base-62ish format in addition to a base-36ish format because of the fact that a considerable number of UUID alternatives did choose base-62ish formats

This flies directly in the face of why Standards exist. A Standard should provide a canonical means by which systems interact. The one thing a Standard must do, and do well, is eliminate any ambiguity and uncertainty about how a system should behave. It's bad enough that we're adding one encoding (which I still think is a bad idea, btw). Adding two or more will muddy things to where we might as well not specify any at all.

broofa commented 2 years ago

@liosk writes:

But anyway this is a variation of broader "replacing or extending the original RFC" questions than just the text format issue.

This issue has had me pondering the whole english vs. metric units debate here in the US. Two competing formats, the inferior one widely adopted, the new one demonstrably better. Unfortunately the only thing worse than working in english units is working in english-and-metric combined. Switching requires navigating a valley of pain that, for the US, has proven too great.

'Kind of feels like we're tap-dancing around a similar problem here.

I'm actually sympathetic to @bradleypeabody's "Crockford32 all the things!" idea. I would prefer that over the current proposal where we support both (or, worse, multiple) formats.

But such an approach would have to be done as a new RFC that deprecates 4122. I don't see how else we could do it. So... is that what we should be doing? Is it time to just bite the bullet and acknowledge that the Right Solution here is an entirely new RFC?

(Note: I can't honestly say whether or not I'm saying this in earnest, or as a thinly veiled attempt to shut down this alternate encoding idea. 👿 I do suspect that future generations will look back on this as one of those, "If only Joseph Dombey had taken a different ship!" moments)

daegalus commented 2 years ago

Ok, I've been avoiding chiming in, because like I said previously I tend to hold unpopular opinions that I tend to try and ignore for the sake of playing nice with the general programming community and work. But I will try to be more detailed and explain some thoughts.

I think @broofa has a good point on what we are trying to achieve and what we are doing with this spec. Are we just extending the old RFC or trying to trely modernize UUID for the future.

If we are trying to just extend, we should honestly drop encodings, the extension stuff, the LUID stuff, etc. We should stick strictly to adding the new UUID versions and their benefits.

If we are truely trying to modernize UUID, we need to take a stronger approach to it and stop being wishy/washy on stuff and just treat this as a new spec that we call UUID Modern or UUID 2.0 or something. The reason UUIDs are falling out of favor is one because of sorting and such, but also encoding and format. The 8-4-4-4-12 encoding is old, inefficient and messy. Especially because of the dashes making things much harder to parse and all the version/variant information that stays static. With a new encoding, that info is there, but encoded into a consistent string like everything else.

It's of my opinion that if library authors choose not to implement it because they are conservative, new libraries will be made and gain popularity. Just because a library isn't supporting it, doesn't mean there is desire to use it by other people, and someone will come in and build it.

Here are my unpopular opinions, and then I will modify them to be more pallettable for everyone:

Take a hard-line stance and make the changes we want to see, without worrying about backwards compatibility of encoding
We want to make an efficient, future proof RFC. While it's impossible to truely future proof, between UUIDv8 and supporting any base encoding, or multiple, we can make this last.
By that regard, we should require Base2-128 or even 256 support, not just make it optional. It's not hard to implement and there are fast, simple implementations that work well.

Now, I understand there are many conservative developers that will not like this. So maybe a balanced approach would be easier to stomach.

Split the RFC into 2. One includes UUID6,7,8 similar to the first few drafts. No new encodings, no 7E or LUID.
In the new modern UUID RFC, add extensions, LUID, force new encodings and jettison the old encoding.

Though this will cause confusion and issues. So maybe stick to just 1 RFC, make a hard-line decision on one new encoding or pair.

Choose 1 pair or 1 encoding period. Make it mandatory but don't allow any others so that library developers have more streamlined work.when updating their libraries to support the new encoding.

I dunno, I personally still think we should do only 1 RFC, with all the changes forced, toss any annoying backwards compatibility and go with it. I personally am ready to do the work to update my library. But again, I am a sea of unpopular opinions in development.

Also apologies for poor formatting, wrote this on my phone.

LiosK commented 2 years ago

Okay, the multiple alternatives idea is not popular. I don't really stick to this idea, but let me elaborate on it again as my point seems still misread. I would propose something like the following text:

Text Format

The canonical text representation of UUID is 8-4-4-4-12 and every library MUST support this format. Additionally, libraries SHOULD support the Base-36 format and Base-62 format (see Appendix XX for detailed algorithm and digit sets). The format used by a UUID string can be detected by the length of textual representation.

In this way the spec can define unambiguous data exchange formats (setting variable-length stuff aside) that all the implementations should follow. I have exactly the same opinion as the following @broofa's.

A Standard should provide a canonical means by which systems interact. The one thing a Standard must do, and do well, is eliminate any ambiguity and uncertainty about how a system should behave.

We're on the same page I believe.

bradleypeabody commented 2 years ago

Just one other idea of how we might approach this I wanted to throw out there:

What if we associated the new combined var-ver field (the 111b variant) with the Crockford base 32 text format? I.e. "The canonical text format for UUIDs with variant 10b (versions <= 6) remains 8-4-4-4-12 as per RFC4122. For the new variant 1110b specified in this document (versions 7 and 8) the canonical text format is Crockford base 32."

This would leave existing implementations as-is but also allow us to switch to Crockford base 32 format for v7 and v8, and have only one canonical format for a given version. Implementation burden associated with the new versions would be present but minimal.

(@LiosK I think it's fine to mention the idea of encodings in other bases - my opinion is that SHOULD is too strong and alternate encodings are a MAY, but regardless I think a core concept is that since UUIDs should be "as opaque as possible" it also stands to reason that plenty of applications will want to just generate UUIDs and treat them as regular strings from there on, in which case the encoding is solely a matter of application-specific requirements.)

martinheidegger commented 2 years ago

One thing about Crockford's format is that it is not a RFC and also has edge-cases that are not well defined. (I had a conversation about this here, referencing a much longer article) It may be necessary to prepare a proper crockford compatible base32 RFC to reference to.

daegalus commented 2 years ago

I think when we refer to Crockfords base32 we only mean his encoding alphabet and order, not the algorithm I think we refer to the RFC4648 algorithm but just use Crockfords alphabet choice. At least that's how I have it implemented in the Dart base32 library and many other base32 libraries I've seen.

LiosK commented 2 years ago

@bradleypeabody,

my opinion is that SHOULD is too strong

You're right! It's my bad I implicitly meant "libraries SHOULD" by "implementations SHOULD". I adjusted the previous comment to clarify this. The updated text is a little bit awkward as a spec text, but it illustrates my intention well.

UUIDs should be "as opaque as possible"

I agree to this to some extent, but

plenty of applications will want to just generate UUIDs and treat them as regular strings from there on, in which case the encoding is solely a matter of application-specific requirements

I don't necessarily agree to this. A standard is helpful only when it coordinates multiple implementations to interact and work together. In this sense, the alternative encodings recommended in the spec are not application-specific things. If the spec clearly defines some encodings, then many libraries will implement the encodings in common. In this way, the spec can help multiple libraries and applications talk with each other using the new encodings.

ben221199 commented 2 years ago

Maybe we could reintroduce the legacy UUID format, next to the 8-4-4-4-12 format, so: 34dc23469000.0d.00.00.7c.5f.00.00.00 (legacy) and 8a885d04-1ceb-11c9-9fe8-08002b104860. This format is specific for Variant#0 and leaves out the reserved field.

Other encodings seem to be out of scope for me, because I think the decision to use other "encodings" is up to the developer. The only requirement should be that there is space for 128 bits.

broofa commented 2 years ago

What if we associated the new combined var-ver field (the 111b variant) with the Crockford base 32 text format?

@bradleypeabody Is the implication here that applications should not use 8-4-12 with the new versions? I don't see users responding well to that. Too many existing DB columns / function signatures / UI widgets are set up for 8-4-12.

broofa commented 2 years ago

Having fun on a weekend morning ....

From the PR4 draft (emphasis mine):

Where required, UUIDs defined by this specification and [RFC4122] MAY be encoded utilizing new techniques such as, but not limited to, Base32, Base36, or Base64. Applications MAY also utilize other encoding techniques such as modulo division or alternate alphabets such as Crockford's base32

Based on the above, I believe the following would all be valid encodings of 08a0c2eb-57c8-4bc5-ae66-1e7e39fd1d99:

⢙⠝⣽⠹⡾⠞⡦⢮⣅⡋⣈⡗⣫⣂⢠⠈
𓊉𓐋𓎕𓇅𓊄𓇌𓁊𓁪𓈧𓂷𓎆𓐙𓀄
🂹🃖🂱🃟🃙🃘🃇🂧🃆🃙🂺🃑🂫🃁🃜🂵🃋🂫🂬🃈🂨
̸̷̩̲̦͈̹̞̓̓͑̓ͮ̎͗͌̃́͡
𒐱𒐖𒐿𒐬𒐽𒑰𒑂𒑚𒐺𒑧𒑨𒑐𒐀𒐲𒑝𒐭𒑲𒑛

See https://codepen.io/broofa/pen/ZEJKWOQ?editors=0010 for details.

bradleypeabody commented 2 years ago

In reply to a few recent posts:

@broofa

Too many existing DB columns / function signatures / UI widgets are set up for 8-4-12.

Unfortunately, you're probably very right about this. The question would then be if it's too much to ask for implementations doing the work to implement the new var-ver field to be updated. Probably it is, but just wanted throw it out there.

If this is generally the case, then my position would stay with the idea of keeping 8-4-4-4-12 as the standard, defining one additional format and call it "compact" or something like that, and then also just mention that people can do whatever they want for their own use cases, just don't expect it to be included in every UUID library.

We would basically be telling library authors that their toString() stays the same, we recommend adding compactString(), and if they do then also update parse() to accept either one, and if they don't like that then there's no law against using whatever encoding is convenient for their own application, it just won't be in the UUID library.

@LiosK

A standard is helpful only when it coordinates multiple implementations to interact and work together. In this sense, the alternative encodings recommended in the spec are not application-specific things.

I agree but I don't follow how having a bunch of variation in the encoding aligns with this idea (and reading this back again I realize I might be conflating your latest proposal with an earlier one). But overall, here's the thing I don't understand: Let's say we had a database that implemented several different text formats for UUID, including base32 and base64, and then you have an application which is using these generated IDs. Your application inserts a new record, which generates a new UUID (let's say for this example the database generated the UUID, although a similar problem exists regardless of where it is created). Now the app does "SELECT ID FROM ...". What format does the application expect to be returned? The only answers I can think of that make any sense are:

A single, specific, standard predefined format (e.g. 8-4-4-4-12).
It doesn't matter because the application will just use the opaque string.
Some predetermined format, e.g. specified as an option using a database feature designed for this. In this case, from my perspective, this is "application specific", because the only way to get this right is for the database and the application to agree ahead of time on which encoding was chosen.

Does it really help to list out a number of different possible formats? If I am a database vendor that is implementing this, how do I choose with of these various possible encodings I should actually spend time on and implement? And why is this the database vendor's problem, as opposed to just letting applications use their own language's encoding functions? Most languages already implement things like base32, base64 and others, so if the application already has to decide ahead of time which encoding to use, would they not be free to use whatever is convenient and available to them? (Like why is base36, mentioned in your text, somehow better than base64?) What if what is useful to me or available is not one of the ones that we thought to mention in the spec? Is that now "less valid" than choosing one of the options you've outlined? I just think there is way too much variation here to expect UUID library authors and databases, etc. to include all of this encoding in each implementation. And even if we just pick the two that you suggest - base36 and base62, in a way that could easily end up being worse because now everyone making a UUID implementation will feel compelled, but not required to implement these, and users will probably end up using base32 or base64 anyway, just because it's more convenient and familiar, and they won't know what encodings are available on the other side (whatever app we're worried about reading these UUIDs).

I think whatever recommendation is provided for this needs to help answer things like "I am a UUID library author, what encoding(s) should I implement vs leave to the user to figure out?" and the same thing for database vendors, etc.

And I don't understand how we can say that base36 and base62 are somehow a better answer than base32 and base64 or base32hex, or any number of other possible encodings. Is there a specific analysis here that makes base36 and base62 the optimum choice?

My goal with championing Crockford base32 has been to just introduce one format that is significantly more useful than 8-4-4-4-12, and (hopefully) to do it in a way that doesn't break existing implementations. I didn't pick Crockford base32 just because I liked it or for one or two specific reasons - my analysis as to why it has the highest utility is outlined above (works in many places - email, DNS, file names, URLs, case insensitive, plus "less swear words", and it can be reliably distinguished from 8-4-4-4-12 regardless of length, and probably a few points above I'm forgetting).

Anyway, sorry the above turned into a bit of rant. To reel it in, my specific feedback is, given this text:

The canonical text representation of UUID is 8-4-4-4-12 and every implementation MUST support this format. Additionally, implementations SHOULD support the Base-36 format and Base-62 format (see Appendix XX for detailed algorithm). The format used by a UUID string can be detected by the length of textual representation.

Why two formats, and why those two formats specifically? Every format we add is more variation and means people will have to figure out which format was chosen (and telling people to use the length I think is not a good idea because since we're leaving this open ended and what happens if someone uses bas32, or base32hex - you can't tell the difference between those from the length). And these two base36 and base62 are not common formats, so every library implementor will have to figure out how to deal with this (whereas Crockford base32 is pretty common).

@Daegalus

Are we just extending the old RFC or trying to trely modernize UUID for the future.

I think we need to work this back and forth against the proposal instead of trying to answer this by itself. In an ideal world we would do both with one RFC. If this cannot be accomplished, then we can discuss other options.

ben221199 commented 2 years ago

I think we need to work this back and forth against the proposal instead of trying to answer this by itself. In an ideal world we would do both with one RFC. This cannot be accomplished, then we can discuss other options.

This draft SHOULD "update" RFC 4122 by adding new versions, like UUIDv6, etc. I have started another draft apart from this repository that describes literally everything about UUID, also things that were not in RFC 4122, but SHOULD have been added back then. That draft will also describe UUIDv6 AFTER publication of this repository (uuid6-ietf-draft) and will "obsolete" RFC 4122. I think that is the best way. So, for this draft here, focus on the main purpose of this draft: new versions. Not the other "out-of-scope" shit. I will make an issue in the near future where I will sum up some things.

bradleypeabody commented 2 years ago

@ben221199 I think that regardless of how it is organized we should collaborate on the work. The current draft here definitely has some things that various people think is "out of scope" which I think is "necessary in order to actually make UUIDs useful for modern applications, so if we're not addressing it why are we doing all this work here". I look forward to your description separately of what you're referring to.

ben221199 commented 2 years ago

@bradleypeabody Yeah, collaboration is fine, but I think we should seperate things that "update" the spec and things that "obsolete" the spec. I think I will write the first draft of what I have in mind and will show it then. I hope it will be somewhat clear then :)

LiosK commented 2 years ago

@bradleypeabody,

having a bunch of variation in the encoding

list out a number of different possible formats

I've never meant this. The any-base any-alphabet code I posted was just for illustration of the algorithm. At this stage, I would propose only two specific, concrete, unambiguous alternative formats: a case-insensitive but longer format and a shorter but case-sensitive format. Under this scenario, the spec can easily provide a concrete and unambiguous way to encode and parse multiple UUID formats. Applications that go for other application-specific encodings might face application-specific hassles but that's not what the standard should (or can) address.

Accordingly, most of questions you threw are not really relevant to my point, and other relevant questions to my point are also relevant to your single Crockford32 alternative approach. What does a database return to SELECT ID FROM ... if the spec defines the canonical 8-4-4-4-12 and the alternative Crockford32? Currently, database vendors can only return 8-4-4-4-12 because there is no other consensus format to follow, but if the spec defines Crockford32 in addition, then the vendors can provide an option to return Crockford32 and can accept Crockford32 as a valid input in INSERT statements. Ultimately, the choice of format is application-specific, but the spec can help provide multiple options in making such a choice, and that is the only sensible way in which the spec can advocate Crockford32 to the real world.

To put it differently, if the spec add one alternative Base-36 format, then UUID libraries will generally support Base36String() method and as a result an application will be able to choose the Base-36 format for its application-specific ID. And if the spec add one more alternative Base-62 format, then UUID libraries will generally support Base62String() and an application can choose this, too. Does the latter Base-62 do a lot of incremental harm? This is my point. Adding one alternative changes a lot of things already, but adding one more does not change as many things as the first alternative, while it can possibly double the number of customers served. IMO the cost-benefit profiles of "canonical 8-4-4-4-12 + one alternative" approach and "canonical 8-4-4-4-12 + two alternatives" approach are quite similar.

Also, the key problem I want to solve is that currently it is not a trivial task at all for an application to go for a compact UUID format. A modern application can be implemented using multiple languages (e.g. JS/Swift/Kotlin for frontend, Go for backend, SQL for database, Python for log analysis) and it is not an easy job to find right Base-X libraries for all of these languages involved. As you've observed, UUID is generally used as opaque string but I would suspect this practice is just a forced choice due to the difficulty to switch to a right encoding depending on contexts.

Is there a specific analysis here that makes base36 and base62 the optimum choice?

Actually, I don't have a strong opinion over specific bases or alphabets, so in the proposed text I would actually mean Base-36ish and Base-62ish, but a lot of people have misread this notation as the any-base any-alphabet approach, and thus I tried to be specific there.

That said, I personally believe that Base-36 is better than Base-32 from a library user's point of view just because Base-36 saves one more character, and Base-62 is better than Base-64 just because Base-62 uses alphanumeric characters only. These two benefits are sufficient reasons to push library authors to harder work, but I admit it's just a matter of view point.

And these two base36 and base62 are not common formats, so every library implementor will have to figure out how to deal with this (whereas Crockford base32 is pretty common).

A little bit different topic, but I think it's an implicit assumption we've been making. I've observed that most of ULID implementations implement their own Crockford32 encoder/decoder instead of relying on an external library. Based on this observation and from my experience implementing base32hex in many languages in my personal project, I'd speculate that Base-X encoding/decoding is such a small function that library authors feel hesitant to add another external dependency. Therefore, availability of existing implementations might not be an important factor in choosing a right encoding/decoding algorithm, because library authors are likely to implement it from scratch anyway.

@ben221199,

I think updating RFC 4122 doesn't mean that no addition of features is permitted. We can add a lot of things to RFC 4122 without obsoleting any of its components. Only a Sith deals in absolutes. We have a bunch of options in between.

bradleypeabody commented 2 years ago

@LiosK Okay and thanks for taking the time to write that up and clarify. I think I understand where you're coming from.

In terms of a specific proposal, it sounds like you're proposing two things: 1. the bases you've selected and 2. having two different alternate encodings (as opposed to just one). Just to state my concerns on these two:

Regarding base36 vs base32 crockford, my main concern is just what is required to implement and how familiar this will/won't be for developers. Crockford base32 is a lot more well-known and well supported, and can be adapted easily from any base32 encoder. I think this makes the bar lower in terms of what hoops developers will have to jump through to implement, and I think this is an important concern.

I'd speculate that Base-X encoding/decoding is such a small function that library authors feel hesitant to add another external dependency

Yeah it's hard to say and varies from language to language. In Go for example there is a base32 encoder/decoder which supports custom alphabets in the standard library, so there's essentially zero benefit to writing your own. I think the same is true of Python. But I realize the situation is different in JS and probably some other languages too.

Regarding adding two more encodings instead of one, my concern here is making it so library authors can still have a single Parse() function which can reliably deduce the format from the input. Yes, I know that you can use the length to do this for base36 vs base62, but it seems brittle in the face of UUID Long (yes I know a bunch of people hate this idea and it's a separate subject - please yell at me/others on the appropriate thread about that one). There is also a bit of a "slippery slope" aspect to each additional encoding that is added - "why not support base32 Crockford AND base36, since we can use the length to distinguish", and so on. After a decision is made here, it will have to be defended against future reviewers and the IETF.

If we pick just one additional format that is easy for people to implement, the whole thing becomes pretty simple. The String() method in people's libraries stay the same (8-4-12). A CompactString() or similar is added. And Parse() is updated to accommodate the output of CompactString() in addition to String(). (I feel like there's a better name than "compact"...). In the database example from earlier, the database vendor would need to provide an option to send the result back in "compact" format, but at least any Parse() function that is updated to match the spec can reliably receive either text format - I think this property goes a long way toward making this approach work.

If we could deprecate the old format, that in some ways we be ideal, but do I think it's too big of a change and could easily harm adoption more than it helped by declaring every existing implementation as deprecated - as opposed to an improvement that can be implemented by library authors as time allows. Deprecating things in RFC4122 I suspect will also reduce the odds of this new spec making to an RFC.

ben221199 commented 2 years ago

@LiosK

I think updating RFC 4122 doesn't mean that no addition of features is permitted. We can add a lot of things to RFC 4122 without obsoleting any of its components. Only a Sith deals in absolutes. We have a bunch of options in between.

When updating an RFC, it adds things to the specification. Take a look at RFC 3501 for IMAP 4rev1. It describes the whole IMAP protocol. It is UPDATED by some other RFCs, that describe extensions, and it is OBSOLETED by one other RFC, that describes the whole IMAP protocol again, but IMAP 4rev2. I think we should do something similar here.

LiosK commented 2 years ago

@bradleypeabody,

I understand your point. I don't mean to stick to my idea and I'm willing to follow the consensus. I've just believed that the two alternatives approach is one viable option that the new RFC can take and wanted it to be put on the table correctly understood.

A couple of technical points:

it seems brittle in the face of UUID Long

I believe alternative formats can work with the variable-length strategy if designed carefully, but they definitely constrain each other. They are in a trade-off relationship; we might be forced to make an exclusive choice. We have to carefully choose which group of customers to serve, and as a result I am against the variable-length idea.

Crockford base32 is a lot more well-known and well supported

Base-62 is tricky, but JS/Java/Rust have native encoding/decoding support for Base-36 and Python have native decoding support. JS/Rust/Python do not even require an import statement. We can also find some hidden efforts by Swift developers. In JS, for example, Base-36 requires literally zero effort because the following one line perfectly works:

0x0622e7c2_d01a_7d65_8fe3_f57b68c36204n.toString(36).padStart(25, "0");

Base-36 tends to be implemented as part of BigInt operations and thus it's more widely available than you believe (than Crockford32 I guess). Base-36 sounds more prominent to me based on your argument.

Anyway, Base-X encoding/decoding needs only a hundred lines of code and I just feel like working for library users rather than authors.

@ben221199,

We can add new versions to RFC 4122 without obsoleting it, and similarly we can add new text formats to RFC 4122 without obsoleting it, can't we?

ben221199 commented 2 years ago

@LiosK

We can add new versions to RFC 4122 without obsoleting it, and similarly we can add new text formats to RFC 4122 without obsoleting it, can't we?

Yes, we can add new versions without obsoleting RFC 4122. So, we are "updating" it. Text formats should be able too, but should think enough about the purpose of new text formats of course.

The draft I'm writing in another repository at the moment is planned to "obsolete" RFC 4122, because the specification describes already existing variants and versions AND could/will (maybe) have new ones that are described in this one.

Here is a post that explains it a little bit and ironically points to RFCs defining the terms: https://stackoverflow.com/questions/32873577/whats-the-difference-between-obsoletes-and-updates-in-rfcs

kyzer-davis commented 2 years ago

Cross-posting my comment variable length UUID thread for visibility.

Group, great discussion, this is why I author the proposed RFC text!

Converting form GitHub threads to "RFC Speak" always drives great conversations and uncover things that may not have been considered. PR uuid6/uuid6-ietf-draft#85's text has been written in a way that I can easily remove E Variant, Alt Encoding, and UUID Long or transpose that XML structure to an alternate Draft that focuses on these topics.

That being said, we have a few engineering challenges with these sections but I am confident this group will be able to derive a great solution! I reviewed the last comments and I will summarize a few of the topics as usual.

Signaling UUID Alt Encoding Method(s)

[..snip..] If System A creates a UUID / UUID Long, encodes it with a random method and sends it to System B as urn:uuid:<encoded_uuid> how does system B determine how to decode that UUID?

Shared Knowledge Systems: We all know we are steering away from shared systems.
draft-multiformats-multibase: This is another Draft RFC. Until IANA adopts it I cannot in good faith adopt it into this draft.
Pick one base encoding: Everybody is on the same page and encode/decode works perfectly fine.
- Downsides: Everybody has an opinion on this matter and usually has a reason for why their base is the best base. Possible solution: Pick one and steer all others to v8/v8E?
Extend UUID's URN: Append the base along with urn:uuid: thus urn:uuid:<base>:<encoded_uuid> such as urn:uuid:base32:3JT57U1QAH859QLSQGSJJ0H7CT

[..other topics from comment truncated.. see the original for more...]

General URN Author note, I would need to do a deep dive on RFC8141 to ensure any potential URN proposals are valid syntax before authoring text.

Editor Note: Must include text about assuming urn:uuid: implicitly equals urn:uuid:base16hd:128: for backwards compatibility reasons. base16hd - Base16 UUID with Hex + Dashes.

sergeyprokhorenko commented 2 years ago

I am convinced that JSON is a much more flexible and convenient message format than URN. See uuid6/new-uuid-encoding-techniques-ietf-draft#2

kyzer-davis commented 2 years ago

Announcement

I had a great discussion with @bradleypeabody and this topic has officially been marked out of scope for Draft 03 (and any future draft.) The XML text is retained and over the next few weeks I will author a separate Draft 00 which includes this topic specifically.

For now please focus on the technical challenges proposed by my previous comment: uuid6/new-uuid-encoding-techniques-ietf-draft#3

Edit: To further clarify, Draft 03 will cover UUIDv6 through v8 + Max UUID. The new Draft 00 will cover E Variant, Alternate Encoding and UUID Long. Two drafts that cover different topics so implementations may choose what they want to support. i.e An implementation supports RFC8675309 for v7 but not RFC123456789 for alt encodings.

kyzer-davis commented 2 years ago

Group,

Following the previous announcement I have drafted up a new RFC Draft document to cover this topic.

Since the discussions threads are in this repo I decided to create new folder for the topic. Github Pages picks up the folder nicely and thus Draft 00 of "New UUID Encoding Techniques" can be found here:

Additionally, I have authored an "Extended UUID URN namespace" for conveying encoding type and length of a UUID to other applications defined by this document. I still have more research on URNs to do but I feel confident enough the proposed URN is backwards compliant with RFC4122 and also compatible with RFC8141.

Personal Note: I do think "inclusive vs exclusive" is the best way to ensure the there we benefit as many folks as possible. If we can solve the problem of conveying the encoding type (and length for UUID Long) then I believe we give implementers a tremendous ammount of power they did not have before.

That being said, as always, I look forward to your responses bring up issues, problems, caveats, and things to think about based on your point of view!

daegalus commented 2 years ago

Based on the most recent post about the draft on Hacker News, an alternate encoding is the most commonly mentioned desire. https://news.ycombinator.com/item?id=31715119

uuid6 / new-uuid-encoding-techniques-ietf-draft

Discussion: Alternate Text Encoding Methods (Crockford's Base32, etc) #3

Text Format

Text Format

Signaling UUID Alt Encoding Method(s)

Announcement

TXT: https://raw.githubusercontent.com/uuid6/uuid6-ietf-draft/master/new-encoding-techniques/draft-davis-peabody-dispatch-new-uuid-encoding-techniques-00.txt