uuid6 / new-uuid-encoding-techniques-ietf-draft

New UUID Encoding Techniques
3 stars 0 forks source link

Discussion: Alternate Text Encoding Methods (Crockford's Base32, etc) #3

Open fabiolimace opened 4 years ago

fabiolimace commented 4 years ago

I think the 'base32hex' encoding, decribed in the section 7 of the RFC-4648, should be used instead of creating a new one. I know that the algorithm is the same with a different alphabet, but the 'base32hex' already exists for the same purpose.

bradleypeabody commented 4 years ago

The reason I wanted to change the base32 alphabet is because it makes implementation (particularly as a db primary key) simpler when UUIDs always sort exactly the same when treated as raw bytes. This holds true for the binary format I'm proposing, but I think it would be even better if it was also true for the text format(s). For some applications sequence is critical, and having the text format when sorted return values in one sequence and the binary another has no up side, only down sides. In my opinion this outweighs the benefit of using the same alphabet that has been used in the past.

@douglascrockford 's base32 (https://www.crockford.com/base32.html) actually looks like a better alternative to what I've laid out in the draft and has the same properties. I'm considering updating the draft to use that.

fabiolimace commented 4 years ago

I understand why you wanted to change the alphabet order 'base32' from A-Z2-7 to 2-7A-Z. However, RFC-4648 also describes another alphabet called 'base32hex' that has this char sequence: 0-9A-V. According to section 7 of the document, "a property of this alphabet, which lacks the base64 and base32 alphabets, is that the encoded data maintains its sorting order when the encoded data is compared in bits".

I agree that Crockford base32 is much better than 'base32' and 'base32hex'. In addition to the property of keeping the sort order, it avoids the confusion of similar characters and prevents accidental obscenity. I am not a native English speaker, but I know some obscene words that can be formed with the letter 'U'. There are other extremely obscene words with 'U' in romance languages ​​like Italian, Spanish, French and Portuguese. There is no need to mention them here.

Before opening this issue, I was thinking of suggesting Crockford encoding, but I was afraid it would not be approved, since 'base32hex' already exists for the sort order problem.

I vote for @douglascrockford 's base32 too.

P.S.: another point is that the Crockford's encoding is adopted by other projects like the ULID specification.

bradleypeabody commented 4 years ago

Okay, great. Unless some better idea comes along I'll update the spec with Crockford's base32 version the next time I do a round of changes. (I was not aware of the difference between 'base32' and 'base32hex' - thanks for pointing that out.)

I think the key idea here is to balance and weigh the concept of "don't invent something new if the existing standard solves the problem" against "use the best tool for the job". I'm sure the question of why not use RFC 4648 will come up again during the standardization process, but as it stands now I think the argument in favor of using something that sorts correctly ("best tool for the job") is stronger than the "use the existing standard" argument.

fabiolimace commented 4 years ago

Great! I have no objections. This issue is solved to me.

fabiolimace commented 3 years ago

@bradleypeabody , have you seem this IETF draft that proposes a compact representation for UUIDs?

The draft aims to update the RFC-4122 to include a new textual representation for UUIDs encoded to base64url or base32. The new representation is temporarily called "UUID-NCName".

bradleypeabody commented 3 years ago

Interesting - I will definitely look into it further and see if I can connect up with the author to collaborate.

bradleypeabody commented 3 years ago

I wrote to Dorian Taylor (the draft author) about the possibility of collaborating. We'll see where that discussion goes.

fabiolimace commented 3 years ago

Great!

Should we open this issue again to show that there is work in progress? Or is it better to leave it closed?

bradleypeabody commented 3 years ago

Good point, reopening it.

daegalus commented 3 years ago

I just wanted to add my voice that we should consider base32 encoding for sure, or base58 encoding that just adds both cases of letters. 123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz which doesnt include 0s and Os. And it will make UUIDs fairly short.

If we do use base32, we should require it be lowercase, easier to distinguish certain characters in most fonts when they are lowercase.

I do also hope you merge with the NCName draft, the UUID format is a bit verbose and long for most usecases. so a different encoding would be great.

bradleypeabody commented 3 years ago

The current proposal does not include an alternate encoding, but let's see how the discussion goes with the IETF. Personally I am in favor of documenting Crockford's base32 (https://www.crockford.com/base32.html) as a valid alternative encoding.

fabiolimace commented 2 years ago

In this comment I will list my concerns about the base-32 format.

What algorithm will be used: the one described in RFC-4648 or a general modulo division?

RFC-4648 algorithm is more appropriate for a variable-length bitstream, for example, a variable-length UUID. But it can only be used with some bases (radix): 16, 32 and 64.

The generic module algorithm is useful to encode integers such as int, long, i32, u32, i64, u64 etc. It can also be used for (almost) any base (radix): 2, 8, 10, 16, 32, 36, 58, 62, 64 and so on. There are some project on GitHub that aim to be a universal base-n encoders.

So if we choose RFC-4648, we can only encode to base-16, base-32 and base-64. But we can't encode to base-58 or base-62, for example.

If we choose modulo division, it will not be easy in some languages to encode UUIDs due to the maximum sizes of primitive types. The implementer will have to use big numbers, like BigInteger in Java, and make some optimizations if efficiency is needed.

The RFC-4648 algorithm is much faster than the modulo division algorithm, but it produces a completely different result. If you want to generate an RCF-4648 output using modulo division, you must pad the input with bits to complete a group of 40 bits.

How to represent UUID as a URN using base-32?

Can we just concatenate the prefix urn:uuid: with the base-32 string like this example?

urn:uuid:01BX5ZZKBKACTAV9WEVGEMMVRY

Is the alphabet used to encode a UUID UPPERCASE or lowercase?

The Crockford alphabet is UPPERCASE encoded, just like the alphabet in RFC-4648. Will base-32 UUIDs also be encoded in uppercase? I think it must be lowercase to be consistent with the current canonical string representation. Either way, this needs to be present in the draft. Although decoders are case insensitive, if a database field contains mixed cases, the sort order can be compromised: UPPERCASE values come before lowercase in the "C" collation.

Reminder: ULIDs are UPPERCASE.

Are there other reasons to use the Crockford's alphabet?

The Crockford's alphabet is very good for our eyes. But the 'flexibility' to decode 'I', 'O' and 'L' as 'ONE', 'ZERO' and 'ONE', respectively, can cause a small problem: it maps more than one string to a single UUID. For example these 2 strings can exist as different primary keys in a database 'OLBX5ZZKBKACTAV9WEVGEMMVRY', '01BX5ZZKBKACTAV9WEVGEMMVRY'. Breaks the uniqueness and sort order promise of UUIDs.

Also, a UUID encoded with the Crockford base-32 alphabet can be confused with a ULID.

Calculate the size of variant-length UUIDs in base-32

The formula for calculating the size of a base-32 variable length UUID MUST be included in the draft. We know that a base 32 encoded UUID is 26 characters long (no padding), but we will have to calculate this for each UUID length other than the default length of 128 bits.

About the optional check symbol

The check character at the end of a Crockford-encoded UUID could be a problem with the variable-length UUID. Is the last character part of the UUID or just a check symbol?

About NCName representation

I think it would be cool to join forces with the proposal of the NCName format. Its author did a great job in my opinion. The algorithm looks a bit complicated at first, but there are good reasons for it.

If it is not possible to work together. maybe it's better not to compete.

EDIT: the base-32 and base-64 alphabets, 'A-V2-7' and 'A-Za-z0-9', respectively, used by the NCName proposal do not produce lexicographically sortable strings. So it is not desirable to use it with UUIDv6 and UUIDv7. It breaks the promise of ascending order.

Finally...

If I could, I would vote to leave base-32 out of scope for the many questions that may arise.

kyzer-davis commented 2 years ago

Re-opened and linking thread from Draft 03 PR: https://github.com/uuid6/uuid6-ietf-draft/pull/58#discussion_r809197767

daegalus commented 2 years ago

I think we should stay modern and use lowercase, regardless of what is chosen.

Also for base32 we can always use our own alphabet. There are many alternatives, I wrote a base32 library for Dart and implemented multiple variants, including the option for a custom alphabet.

Example:

If you notice some of them are lowercase.

We could also consider Base58.

If you want, I can write up some quick code to encode using the various bases to post as examples to see what we feel we like best.

Since I have UUID6/7/8 (draft 1) implemented in my dart uuid library and a base32 library I created for an OTP library, I can spit out examples pretty quick. I also have a local Base58 library and base64 is built in.

fabiolimace commented 2 years ago

About base-32

As for the Crockford alphabet, to adopt a subset of it, we have to list the restrictions in the document: (1) all lowercase letters; (2) no = padding; (3) RFC-4648 algorithm (assuming variable-length UUID); (4) with bit pads to form groups of 40 bits (assuming RFC-4648 algorithm); (5) do not decode the letters 'Oo', 'Ii' and 'Ll'; (6) do not confuse with ULID (!).

If the base32hex alphabet defined in RFC-4648 is adopted (don't confuse it with base32), the only restrictions are: (1) all lowercase letters; (2) no = padding. Much easier to explain.

If the main reason for the adoption of the Crockford alphabet is human frendlyness, then that goal can also be achieved with GEOHASH, which is already lowercase and prohibits 'a', 'i', 'l' and 'o'.

The other base-32 alphabets are not lexicographically sortable.

About base-58 and base-62

I was also considering base-58 and base-62 as they are the same size as base-64 strings (no pads). But they MIX uppercase and lowercase letters. It's a problem because one of the reasons for this draft is the lexicographical order. A typical default collation is 'C', which sorts all uppercase letters BEFORE all lowercase letters. It works differently than other collations like 'en_US'. Below is an example using PostgreSQL. Note that the sort orders at the end of the code block are VERY different. The conclusion is that if you want a consistent sort order on the varchar key, don't mix uppercase and lowercase letters.

----------------------------------------------------
-- CREATING TABLES WITH DIFFERENT COLLATIONS
----------------------------------------------------

-- Standard Collation (ASCII order)
create temporary table tmp_collate_c (
    name character varying(255) COLLATE "C"
);

-- Collation for American Englisth ('natural' order?)
create temporary table tmp_collate_us (
    name character varying(255) COLLATE "en_US"
);

----------------------------------------------------
-- INSERTING VALUES
----------------------------------------------------

insert into tmp_collate_c values ('AAAAAA');
insert into tmp_collate_c values ('BBBBBB');
insert into tmp_collate_c values ('CCCCCC');
insert into tmp_collate_c values ('aaaaaa');
insert into tmp_collate_c values ('bbbbbb');
insert into tmp_collate_c values ('cccccc');
insert into tmp_collate_c values ('AAAaaa');
insert into tmp_collate_c values ('BBBbbb');
insert into tmp_collate_c values ('CCCccc');
insert into tmp_collate_c values ('aaaAAA');
insert into tmp_collate_c values ('bbbBBB');
insert into tmp_collate_c values ('cccCCC');

insert into tmp_collate_us values ('AAAAAA');
insert into tmp_collate_us values ('BBBBBB');
insert into tmp_collate_us values ('CCCCCC');
insert into tmp_collate_us values ('aaaaaa');
insert into tmp_collate_us values ('bbbbbb');
insert into tmp_collate_us values ('cccccc');
insert into tmp_collate_us values ('AAAaaa');
insert into tmp_collate_us values ('BBBbbb');
insert into tmp_collate_us values ('CCCccc');
insert into tmp_collate_us values ('aaaAAA');
insert into tmp_collate_us values ('bbbBBB');
insert into tmp_collate_us values ('cccCCC');

----------------------------------------------------
-- THE RESULTS
----------------------------------------------------

select * from tmp_collate_c order by 1;

AAAAAA
AAAaaa
BBBBBB
BBBbbb
CCCCCC
CCCccc
aaaAAA
aaaaaa
bbbBBB
bbbbbb
cccCCC
cccccc

select * from tmp_collate_us order by 1;

aaaaaa
aaaAAA
AAAaaa
AAAAAA
bbbbbb
bbbBBB
BBBbbb
BBBBBB
cccccc
cccCCC
CCCccc
CCCCCC
bradleypeabody commented 2 years ago

The conclusion is that if you want a consistent sort order on the varchar key, don't mix uppercase and lowercase letters.

I agree. With how common case-insensitivity is in real-world applications, I don't think it would be wise to use a text format that relies on case sensitivity for proper ordering (one of the key things UUIDv6 and 7 address).

As for the Crockford alphabet, to adopt a subset of it, we have to list the restrictions in the document: (1) all lowercase letters; (2) no = padding; (3) RFC-4648 algorithm (assuming variable-length UUID); (4) with bit pads to form groups of 40 bits (assuming RFC-4648 algorithm); (5) do not decode the letters 'Oo', 'Ii' and 'Ll'; (6) do not confuse with ULID (!).

I agree but I think the simplicity is we just say something to the effect of:

"Encode every five bits with this alphabet: 0123456789abcdefghjkmnpqrstvwxyz If you don't have 5 bits for the last character then fill with zeros. There is no padding. When decoding discard final bits that don't form a complete byte. Encoding SHOULD emit lower case letters as per the alphabet shown. When decoding, implementations MUST accept either upper or lower case letters. Implementations MAY choose to interpret additional characters per "Crockford base32" or otherwise, as long as it doesn't conflict with the above alphabet."

It's terse and I'm sure will need to be expanded a bit, but I believe it does cover everything and makes it so applications can just use an existing Crockford base32 library for the encoding or just make something with that specific alphabet.

(6) do not confuse with ULID (!).

You will notice that if we use 48 bits for the timestamp and move to the var-ver field then ULID is actually compatible both in text and binary format with UUIDv7, with the exception of the requirement of setting var-ver at byte 9 to 0xE7. While this is not vital to the success of the spec, I don't think it's a bad thing either.

bradleypeabody commented 2 years ago

And I'll also add here that I do think we should just pick one additional text format that is as useful as possible (specifically: as compact as possible while being case insensitive, and when treated as raw ASCII bytes sort the same as the raw UUID bytes, and it should be easy and reliable to distinguish from the old hex+dashes format). We don't want a proliferation of various possible formats, because we would like whatever ParseUUIDFromString() functions get written to be simple and unambiguous.

Keeping in mind one of the key goals here: If you were using this as an ID for a database record, and you did SELECT id FROM x;, what would you want to see come back for that ID column? The existing hex format with dashes is just unnecessarily long - that's really the key thing we'd be addressing with this.

Also in some cases there may not be great binary encoding/decoding support built into the database and someone might end up just storing UUIDs in text format as a string in the id field, in which case the length of the text format has a direct impact on size and performance, so it "matters" in that case and is not just about aesthetics and preference.

fabiolimace commented 2 years ago

You will notice that if we use 48 bits for the timestamp and move to the var-ver field then ULID is actually compatible both in text and binary format with UUIDv7

I disagree on this point. Even with the same timestamp size and the same charset, the base 32 strings will be different because the ULID specification uses modulo operations to encode the 128 bits to 26 characters. See the table below:

| algorithm | 01234567-0123-0123-0123-012345678abc | ffffffff-ffff-ffff-ffff-ffffffffffff |
| RFC-4648  | 04hmasr14c0j609304hmaswaqg           | zzzzzzzzzzzzzzzzzzzzzzzzzw           |
| Modulo    | 014d2pe09304hg28r14d2pf2nw           | 7zzzzzzzzzzzzzzzzzzzzzzzzz           |
bradleypeabody commented 2 years ago

@fabiolimace Wow, I totally missed that. You're right. That is really strange - it seems like they prioritized breaking the string representation up into a specific timestamp part and random part, which seems like an odd choice over the simplicity of just assembling the raw bytes and then encoding the whole thing. They also waste a couple of bits (10 characters is 50 bits, not 48).

Okay, yeah well forget ULID then. I do think the rest of the approach is sound though.

fabiolimace commented 2 years ago

They had to generate the random component char by char to support browsers that only provide Math.random() as a random source. The random component is 80 bits long, i.e., 2 groups of 40 bits, which is the perfect "cycle" for encoding to base-32 (the same size as the groups of bits in RFC-4648). That's why the approach that encodes the time component separately from the random component works. Splitting is a workaround.

LiosK commented 2 years ago

Would add one thing re base32: base32hex allows implementers to leverage standard language features in some languages. For example, in JavaScript, the following code produces base32hex text representation:

// Assume h40 = higher 40 bits, m40 = middle 40 bits, and l48 = lower 48bits of
// UUID stored as IEEE 754 number values
const txt =
  h40.toString(32).padStart(8, "0") +
  m40.toString(32).padStart(8, "0") +
  (l48 * 4).toString(32).padStart(10, "0");

Translating one set of alphabets to another is a trivial job, so the above point shouldn't be the priority; complex alphabet sets are justifiable if sufficient benefits exist. That said, using standardized widely available encoding will contribute to simplicity.


BTW, is the base58 algorithm employed by bitcoint the generic modulo algorithm? It reads the input byte stream from the most significant to the least, and given this, any base N encoding can be applied to variable-length UUID in a similar way, right?

Note: the base58 algorithm described in this draft RFC is completely broken.

fabiolimace commented 2 years ago

For example, in JavaScript, the following code produces base32hex text representation:

Your snippet is perfect.

Translating one set of alphabets to another is a trivial job

Yes. Any utility like tr or translate() could be utilized to translate a base32hex string to Crockford charset (or any base-32 charset).

BTW, is the base58 algorithm employed by bitcoint the generic modulo algorithm?

I did not know that. Part of what I said above is false. Thanks for the correction.

LiosK commented 2 years ago

Part of what I said above is false.

Wait. Perhaps, I am wrong! On second thought, the modulo algorithm generally reads from the most significant byte if implemented using a byte array. And you are correct that the modulo algorithm does not produce lexicographically ordered results if extra bytes are appended at the bottom. For example:

0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffffn.toString(36);      // 'f5lxx1zz5pnorynqglhzmsp33'
0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_ffffn.toString(36); // 'laebsi7rpgfoz4oncaub1gcx0flr'

But at the same time I am wondering if there is a technique to utilize the base36 encoding in a dictionary-ordered manner, such as something like the following (more or less broken) example where a 128-bit UUID is encoded as a larger 144-bit integer and truncated to the 25 character representation:

0xffff_ffff_ffff_ffff_ffff_ffff_ffff_fffe_0000n.toString(36).substring(0, 25);  // 'laebsi7rpgfoz4oncaub1gcwx', 128-bit UUID
0xffff_ffff_ffff_ffff_ffff_ffff_ffff_fffe_0001n.toString(36);                   // 'laebsi7rpgfoz4oncaub1gcwxmgx', UUID with 16-bit extension
0xffff_ffff_ffff_ffff_ffff_ffff_ffff_fffe_00ffn.toString(36);                   // 'laebsi7rpgfoz4oncaub1gcwxmnz', UUID with 16-bit extension
0xffff_ffff_ffff_ffff_ffff_ffff_ffff_fffe_ffffn.toString(36);                   // 'laebsi7rpgfoz4oncaub1gcwz11b', UUID with 16-bit extension
0xffff_ffff_ffff_ffff_ffff_ffff_ffff_ffff_0000n.toString(36).substring(0, 25);  // 'laebsi7rpgfoz4oncaub1gcwz', 128-bit UUID

While base32 schemes are easy options, I am also interested in base 36, 58, and 62 because they produce more compact format. I've been just thinking like, if a new text format should be defined in the new spec, we should seek for several options and pick the best one, rather than choosing an easy option.

daegalus commented 2 years ago

While base32 schemes are easy options, I am also interested in base 36, 58, and 62 because they produce more compact format. I've been just thinking like, if a new text format should be defined in the new spec, we should seek for several options and pick the best one, rather than choosing an easy option.

Earlier in this discussion we found that going above base32 forces us to include symbols or capital letters. Mixing upper and lowercase letters can cause problems in database sorting and retrieval.

Base32 is the safest. We could push to base 36, as that will just include all letters and numbers, but then there could be confusion between i l 1 0 O for human readability and friendliness.

Right now I am personally leaning to Geohash without padding as the ideal character set base32 for UUID. (This is === style padding on the encoded string, not padding of the UUID bytes)

I feel it's more human friendly than Crockfords with all the base32 benefits.

LiosK commented 2 years ago

Mixing upper and lowercase letters can cause problems in database sorting and retrieval.

I was aware of this discussion. I was just concerned if we'd missed more important needs. For example, the most compact form would be preferred when UUID is encoded in JSON and transferred over the network, and perhaps that is one of the most frequent and large-scale use cases of textual UUID nowadays because JSON has no way other than relying on text, while RDBMS can fall back on binary or change the collation settings. Although UUIDv6 and v7 are focused on convenience as a database key, the text format is applicable to all versions; therefore, the priority here isn't immediately obvious.

Secondly, I don't really see the value of allegedly human-friendly alphabets over the standardized base32hex when we already have the very human-friendly 8-4-4-4-12 format. Crockford's, etc. add some complexity to the implementations and the specification while the value proposition is unclear IMHO. Or, alternatively, if we depart from base32hex, why don't we go further to more sophisticated approaches like NCName and base 36 to pursue better features?

P.S. Personally, I like base 32 schemes because they are simple and easy. I am just trying here to challenge them as an input to the discussion because we need strong arguments to choose one over another. Plus, the JSON use cases doesn't seem negligible.

daegalus commented 2 years ago

You know, the more I think about it, the more I think human readability shouldn't limit us so much. 90% of the use-case for UUIDs is for unique user IDs, mostly live in databases, urls, cookies, etc. but a user rarely needs to read or dictate them.

As for developers, in those situations, they will be copy and pasting UUIDs, not dictate, or transcribing them.

With that thought, we should maybe focus on base 58 or base 62 (which is just base64 without the 2 symbols), without padding ideally. These will make shorter UUIDs and still be sortable.

As for databases, I agree that you can change the collation settings. Also most usage of this will be in new systems, or new development, so they will just either be using a more modern collation default, or know to set it to something else.

bradleypeabody commented 2 years ago

Crockford's, etc. add some complexity to the implementations and the specification while the value proposition is unclear IMHO

I agree that human-readability is not necessarily at the top of the priority list. But, my logic on it was essentially:

So it seemed a good middle ground.

the most compact form would be preferred when UUID is encoded in JSON and transferred over the network, and perhaps that is one of the most frequent and large-scale use cases of textual UUID nowadays because JSON has no way other than relying on text

This is a really good argument. My only objection is that I'm not sure it outweighs the the bullets above.

Also keep in mind that per the JSON spec, binary data is encoded as base64, so technically the raw binary bytes could be transmitted in JSON as base64 and that's not our doing, just part of how JSON says to encode binary data. That would be pretty confusing though, one text representation for JSON and then another everywhere else. But maybe this cuts to the core issue: we basically have to decide which is more important:

If we did provide e.g. a base64 text format and a base32 text format, then length could be used as an indicator of which one was chosen, but it could also be brittle and will tick off all of the "I want to add more bytes to end of my UUIDs because it's not enough entropy" people and likely break their custom solutions.

Personally I think the entire concept of "parsing" a UUID is unnecessary complication, but that opinion is very limited to the scope of my own personal experience, and there are a lot of people who are used to the prior UUID versions and will probably want to do this no matter what I say. My point being: If people just treated UUIDs as opaque strings, then it wouldn't matter what the encoding format was, and we could safely provide multiple options.

LiosK commented 2 years ago

we basically have to decide which is more important:

Agreed.

Perhaps, there are a considerable number of use cases that need case-insensitive encodings and a considerable number of use cases that need compact encodings, respectively. If the standard misses one set of needs, many application-specific encodings might emerge to meet the needs. Then, it would be helpful if the RFC specifies both a case-insensitive format and a shorter format to ensure some levels of interoperability across many UUID libraries.

In this scenario, Base 36 can be attractive as a case-insensitive format if Base 58/62 is employed for the shorter encoding because they are essentially the same technique. If we go for Base 32 for the case-insensitive side, then Base 64 is a more natural and consistent choice for the short form. Users of UUID libraries will be much happier with the former pair, while library authors will be happier with the latter pair. Judgement is needed.

"parsing" a UUID is unnecessary complication

IMO "parsing" is a must if there are multiple encodings including 8-4-4-4-12 because without parsing we can't even canonicalize UUIDs in different encodings automatically!

bradleypeabody commented 2 years ago

without parsing we can't even canonicalize UUIDs in different encodings automatically

This brings up an interesting point, but is along the same vein: The way I generally think about unique identifiers is that there should, ideally, be one canonical form that you use throughout a given application. E.g. if you use base64, then that ID should be base64 everywhere, and you essentially use the textual format for everything. This generally leads to everything being a lot simpler. But, I realize it has downsides as discussed above in that one could conceivably want to use a different format for a different transport, because they do have different tradeoffs - smaller vs easier to read and possibly others.

I'll think about this more and see if I can post something more helpful as a followup. But it definitely occurs to me that from a practical standpoint I have on a number of projects just completely ditched the binary form of the UUID and used a textual string exclusively and without any encoding whatsoever. This was less space-efficient because everything is ASCII and not binary, but it was also much easier to deal solely with IDs as strings and not even try to parse them, even if it wastes a few bits during storage and transmission. I'm not saying this is ideal for all cases, but I'm just trying to see if there is some basic core concept we're missing here about this correlation between the raw bytes, the textual format(s), and when we should be recommending each is used, and the fact that people can also encode the raw bytes however they want - for better or for worse.

fabiolimace commented 2 years ago

My personal choice is base-32 with base32hex charset. It's short, simple, and sortable.

If something shorter than that is needed, but the sort order is not important, base64url is my preferred encoding. Both are from RFC-4648.

I don't want to mess up this long thread anymore. But if human sympathy is the reason for the preference for Crockford and base-58 charsets, I venture to propose a base-52 or base-48 alphabet.

Chances are I will fail, but I will try. Tell me what you think.

Base-52

Charset: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.

These are the features of base-52:

Base-48

Charset: ABCDEFGH JKLMNOPQRST VWXYZabcdefghijk mnopqrst vwxyz.

If human friendliness is an important thing, we could remove 4 letters from base-52 to form a base-48 charset. The letters to be prohibited are: I, l, U, u.

The new base-48 has the same features of base-52 above, with 2 additions:

But it drops the beauty of 'A-Za-z'. I can live without it :)

ben221199 commented 2 years ago

I have some feeling that other encoding types could be better described in another spec than this one.

bradleypeabody commented 2 years ago

I have some feeling that other encoding types could be better described in another spec than this one.

And just to mention my position on this: This document has been work in progress for approaching three years now. The initial work started a couple years before that. One of the primary goals is to make a spec useful for databases and people writing applications which store UUIDs in databases. If after all this work we don't cover what the text format(s) is/are, I think that greatly reduces the usefulness of the whole project. One way or another, I would really like to see this sorted out so someone reading the doc and going "okay I'm going to implement these new UUID format, what do I do when I implement the String() method" has a good answer. (And yes, "just use the old hex format" is an answer, but I don't believe it is the optimal one.)

cbandy commented 2 years ago

@bradleypeabody wrote:

Also keep in mind that per the JSON spec, binary data is encoded as base64, so technically the raw binary bytes could be transmitted in JSON as base64 and that's not our doing, just part of how JSON says to encode binary data.

I don't see this; can you provide a link?

LiosK commented 2 years ago

Base-52

Base-52 is actually an interesting option to me because I am slightly concerned about leading zeros that will always appear in UUIDv7 strings for several centuries. Some applications do not handle zero-starting numerals very well (notably, Excel's default CSV handling). It shouldn't matter a lot because a UUID string is rarely composed of 0-9 only, but base-52 and base-58 schemes completely address the potential leading-zero issue.

I have some feeling that other encoding types could be better described in another spec than this one.

I used to have the same feeling but now I'm more or less convinced that the text format is an urgent issue worth being addressed in the standard. I don't believe the new RFC should aim to deprecate or supersede the 8-4-4-4-12 format, because it is very widely used right now and is so distinctive and human-friendly that everyone immediately recognizes it as a UUID string. But imagine a world where UUID libraries generally support multiple encodings in common and applications can choose the ones that fit to their needs. It would be so nice! The new RFC can make this happen.

This project has been primarily advertised to fix the sort order issue and thus many people believe the text format is out of scope. Perhaps, some sort of "rebranding" is necessary to draw attention of broader audience to text formats.

daegalus commented 2 years ago

Base-52

Base-48

Oh, I like both of these. I like 52 but can see human readability to be a problem if that is a priority. But I rarely find U u V v to be confusing. So maybe even a Base50?. Throw in 2-9 and get a base 58 variant.

But I do like the cleanliness of not dealing with numbers.

broofa commented 2 years ago

And yes, "just use the old hex format" is an answer, but I don't believe it is the optimal one.

I'm more or less convinced that the text format is an urgent issue

@bradleypeabody @LiosK Can you elaborate on these comments? I get that the current hex form is not optimal, but I'm not convinced a good enough case has been made for additional formats to overcome the advantage of having a single, canonical text format (albeit an imperfect one.)

A single format makes it unambiguous how UUIDs should be transported. That's a really nice trait to have when designing a system or API that interfaces with unknown systems. (It also keeps things simple for library authors, like myself). As soon as we add a new text format - any format - we break complicate matters in that regard significantly.

[Edit to add: I guess this boils down to much the same question we've had elsewhere about whether we're replacing or extending the existing RFC. If we're replacing the RFC and could effectively deprecate the legacy text format, I'd be fine with choosing a different, more optimal format. But if we're not then maybe we should avoid introducing complexity unless it's clearly warranted?]

bradleypeabody commented 2 years ago

@cbandy My bad, this approach of encoding binary data as base64 appears to just be a convention of some json encoders, not a standard.

@broofa I hear you on the simplicity of just having one text format. To understand the pain I am trying to address, let's look at an example:

The above general scenario, while some of the elements may be slightly contrived, more or less echos the story of thousands of startups or other growing businesses. As you can see, if you have a good ID generation pattern, these IDs can end up in all sorts of places/scenarios.

Here's my analysis of what happens to the above with scenarios of different text formats:

Base32

Base64

Base52

Hex

Allow Multiple Text Formats

Because of all of the tradeoffs, it's easy to say let's allow multiple formats. However, this creates a new set of problems: It means any place we do actually need to parse the text format will need to also know what format is intended. Since one important theme from the above is space efficiency, and most applications will want to just pick one format and stick with it if possible, trying to carry the format info around (e.g. with another field indicating the text format sent along with the data, or some sort of prefix, e.g. "ub64:UUIDB64ENCODEDHERE_-") - doing this sort of thing doesn't make sense if we're trying to keep IDs short.

So in practical terms, I think we either need to:

  1. have one text format, or
  2. if we allow multiple, you have to be able to reliably tell the difference between them while parsing (and unfortunately you can't tell the difference between base32 and base64 by looking at them - except by length, and which raises a different question about how brittle that might be or what happens if someone decides they want to add more bytes for entropy - yes, not part of the spec, but also a case that has been mentioned many times)

Recommendation

The above analysis is why my recommendation is still: use Crockford base32 as the default for new UUIDs (old hex format is still allowed), and then make Parse() implementations look for a "-" and parse as hex if found, or Crockford base32 if not. This effectively addresses every concern listed above, at the expense some space (in practical terms, it means UUIDs in text form will be 26 bytes long instead of 20 (base64), but at least it's not 36.

And, by ensuring we have a reliable path to convert between binary and text, we make it so if we really, really care about size (e.g. wire format over expensive transmission methods, or the cumulative effect of size efficiency while inserting billions of UUIDs into a database over time) - for these cases we're ensuring that you can use the binary form easily because your UUID.Parse() method just works as you would expect instead of having ParseBase32 and ParseBase64 and no easy way to know in your application when you should be using which one.

It's also worth noting that a UUID is, at the end of the day, just some bytes. So if someone really disagrees with the above decides that base64 or base52 is really the thing for them, there is nothing stopping them from just doing Base52Encode(NewUUID()). All we're really trying to decide here, is what should String() and Parse() do in a UUID library that supports this new spec.

P.S. Sorry that was so long, but I think it's an important analysis that hadn't been clearly stated before.

fabiolimace commented 2 years ago

Increased implementation complexity. (I'm still not clear on how you implement an encoding like this which is not bit-aligned - does this mean one needs to treat all 16 bytes as a 128-bit integer and do a module on the whole thing? Can you even do such a thing in languages like JS? How difficult is it?)

The only way I know of is to treat all 16 bytes as a 128-bit integer and do successive modulo operations. I have no idea how to encode other bases than 16, 32, and 64 without modulo. And I tried.

In Javascript it is necessary to use BigInt. If anyone is interested, an implementation in Python and JS can be found here. JS code can be tested online here.

P.S. Does anyone know how to do this without modulo/remainder?

LiosK commented 2 years ago

@fabiolimace, please take a look at my quick code that supports base-2 through base-128 modulo encoding/decoding of byte arrays of any length. This algorithm works in any language that supports arrays and numbers. This code can be simpler if it supports a single digit set and 128-bit input only; I will paste a simpler extract later.

Edit: Added a simple base36 implementation in JS. IMO the modulo encoding isn't that complex when written in source code. It's very heavy computationally for a large input byte array, but that's not a big deal for 128-bit arrays. The number of loops can be easily reduce to 1/4 or 1/5 in the following code by reading multiple bytes/digits at a time; see my up-to-date quick code.

const DIGITS = "0123456789abcdefghijklmnopqrstuvwxyz";

/** O(1) map from ASCII code points to digit values. */
const DECODE_MAP = new Uint8Array(128).fill(0xff);
for (let i = 0, uc = DIGITS.toUpperCase(); i < DIGITS.length; i++) {
  DECODE_MAP[DIGITS.charCodeAt(i)] = i;
  DECODE_MAP[uc.charCodeAt(i)] = i; // omit this to make decoder case-sensitive
}

/** Encodes a byte array to text. */
export const encode = (bytes) => {
  const dst = convertRadix(bytes, 256, DIGITS.length);
  let text = "";
  for (let i = 0; i < dst.length; i++) {
    text += DIGITS[dst[i]];
  }
  return text;
};

/** Decodes text to a byte array. */
export const decode = (text) => {
  const src = new Uint8Array(text.length);
  for (let i = 0; i < text.length; i++) {
    const c = DECODE_MAP[text.charCodeAt(i)] ?? 0xff;
    if (c >= DIGITS.length) {
      throw new SyntaxError("invalid character");
    }
    src[i] = c;
  }
  return convertRadix(src, DIGITS.length, 256);
};

/** Converts a digit value array in `srcRadix` to that in `dstRadix`. */
const convertRadix = (src, srcRadix, dstRadix) => {
  const dstSize = Math.ceil(
    src.length * (Math.log2(srcRadix) / Math.log2(dstRadix))
  );
  const dst = new Uint8Array(dstSize);
  for (let i = 0; i < src.length; i++) {
    let carry = src[i];
    for (let j = dst.length - 1; j >= 0; j--) {
      carry += dst[j] * srcRadix;
      dst[j] = carry % dstRadix;
      carry = Math.trunc(carry / dstRadix);
    }
    // assert(carry === 0);
  }
  return dst;
};

Edit: Added a simple base32 implementation based on RFC 4648 algorithm. It's twice as fast as base36 modulo algorithm but the source code is as complex as base36 with slightly more lines of code.

const DIGITS = "0123456789abcdefghijklmnopqrstuv";

/** O(1) map from ASCII code points to digit values. */
const DECODE_MAP = ... // same as base36

/** Encodes a byte array to text. */
export const encode = (bytes) => {
  ... // same as base36
};

/** Decodes text to a byte array. */
export const decode = (text) => {
  ... // same as base36
};

/** Converts a digit value array in `srcRadix` to that in `dstRadix`. */
const convertRadix = (src, srcRadix, dstRadix) => {
  const srcRadixBitLen = Math.log2(srcRadix);
  const dstRadixBitLen = Math.log2(dstRadix);
  const dstSize = Math.ceil(src.length * (srcRadixBitLen / dstRadixBitLen));
  const dst = new Uint8Array(dstSize);

  let carryBitLen = 0;
  let carry = 0;
  let dstCursor = 0;
  for (let i = 0; i < src.length; i++) {
    carryBitLen += srcRadixBitLen;
    carry = (carry << srcRadixBitLen) | src[i];
    while (carryBitLen >= dstRadixBitLen) {
      carryBitLen -= dstRadixBitLen;
      dst[dstCursor++] = carry >>> carryBitLen;
      carry &= (1 << carryBitLen) - 1;
    }
  }
  dst[dstCursor] = carry << (dstRadixBitLen - carryBitLen);
  // assert(dstCursor === dst.length - 1);
  return dst;
};
LiosK commented 2 years ago

@broofa,

Text format is definitely a problem that has been damaging to the usefulness of UUID. Most of recently emerged UUID alternatives utilize shorter formats than 8-4-4-4-12. This fact exemplifies the pain application developers feel about UUID. Without a better text format, more and more users will run away from UUID.

That said, I have a very different view on this topic than @bradleypeabody's. I believe the new RFC should extend the old one and should not (or doesn't need to) deprecate or substitute the original RFC. Accordingly, I believe the new RFC should be clear that 8-4-4-4-12 is still a single canonical text format that every implementation MUST support. 8-4-4-4-12 has been and will be widely used without any problem in many applications. Such applications include very conservative use cases that need decades to switch from one format to another. For example, I don't believe UUID strings that show up on UEFI's GPT config screen should be changed to Crockford's or something. The new RFC should not confuse such users or leave them behind; otherwise, the new RFC might be seen as a folk project that effectively divides the existing UUID community and ecosystem.

But at the same it would be very valuable if the spec would recommend several text format options that implementation SHOULD support. Currently, application designers have to invent application-specific text format by themselves if they don't like 8-4-4-4-12. It's actually a very difficult task for modern applications because they employ multiple languages (e.g. JS/Swift/Kotlin for frontend, Go for backend, SQL for database) and have to find an interoperable library for each of such languages. If the UUID spec specifies some alternative formats, then UUID library authors can work on this task on behalf of application developers. UUID libraries will gradually implement features to generate/parse/convert multiple UUID formats, and at some point in the future when UUID libraries generally provide multi-format functionality, application designers will be able to choose text formats that fit into their needs.

This definitely complicates UUID libraries but simplifies applications, and since few people actually write UUID libraries while tons of people rely on them, the overall positive impact will be enormous. Plus, library authors are generally supposed to be subject-matter experts in the area of algorithms and bit/byte handling, they will do a good job. That's why I am leaning towards providing multiple (say, base 36 and 62) feature-rich if complex encoding schemes.

fabiolimace commented 2 years ago

@LiosK excellent! You did it!

We now have an encoder for multiple bases, multiple character sets, and multiple data lengths.

If an alternative base is chosen, like base-36, a short section should be added to the specification with a C version of the code.

fabiolimace commented 2 years ago

I was curious what the probability is that a four-letter word occurs in an base-n UUID series. So I created a loop to count the matches. If you are also curious, check out this Gist.

This is related to the prohibition of the letter 'U' in Crockford encoding. More matches could be interpreted as more accidental bad words.

This could be done by calculating the probability for each encoding, but I'm lazy :)

PS1: I don't think it is important. But it might be for some people.

PS2: Base-36 wins this game.

kyzer-davis commented 2 years ago

Fantastic discussion group!

Let me summarize key bullets I pulled out of the thread so I can merge some text for review:


Did I miss anything?

bradleypeabody commented 2 years ago

@kyzer-davis Here's my summary on this:

That's my analysis. If there are specific challenges to the logic above, then I'd like to get those specifically listed out here so we can nail this down.

LiosK commented 2 years ago

I agree that the spec should recommend some specific encodings and alphabets so UUID library authors can implement such formats in an interoperable way. Whether to recommend a single alternative format or multiple alternative formats is in question. While ULID utilizes Base-32 and is so influential, many UUID alternatives employ shorter Base-62 and Base-64. For this reason, I am not really convinced that a single alternative text format can sufficiently address issues that have driven many people for UUID alternatives. Also, from an implementer's point of view, supporting multiple formats doesn't seem to be a big deal as long as the multiple formats use the same algorithm (e.g. base-32/64 pair, base-36/62 pair).

bradleypeabody commented 2 years ago

@LiosK Keep in mind that if we did select just one text format to be indicated in the specification, this does not mean that other formats cannot be used by applications for their own purposes. After all, for most applications it's just an ID and there's nothing inherently wrong with someone using some other text format if that's what's best for their application. Also realize that even if the spec did say you can use either base64 or base32, there would still be cases where people want more density, or don't like having digits in the alphabet, etc, etc. - it's simply not possible to please everyone and there will be tradeoffs. I would also be fine with some text in the spec which describes the tradeoffs here and suggests base32 crockford as the default and clearly says that there may be other cases for specific applications where some other encoding is warranted (e.g. base64) and there is nothing that prevents them from doing this.

That said, please help me understand this point:

from an implementer's point of view, supporting multiple formats doesn't seem to be a big deal as long as the multiple formats use the same algorithm (e.g. base-32/64 pair, base-36/62 pair).

How would the implementor of a library update functions like String() and Parse() to accommodate both formats? Please describe the algorithm you suggest for this.

fabiolimace commented 2 years ago

How would the implementor of a library update functions like String() and Parse() to accommodate both formats? Please describe the algorithm you suggest for this.

The specification could define two things:

These standard methods could be defined by the specification:

These alternative methods could be defined by the specification:

broofa commented 2 years ago

The implementation question gets tricky, especially if the spec is defining (or “suggesting”) the API. for example in JavaScript passing a format arg rather than having a separate method will prevent bundles from doing tree-shaking.

This means all bundled code will have to include the base-x encode and decode implementations, even if it’s never used.

LiosK commented 2 years ago

@bradleypeabody,

My point in the previous two comments is that the standard can help library authors write libraries that work together. If the spec defines one alternative format, then libraries will generally support one alternative and dependents can choose from one. If the spec defines two alternative formats, then libraries will generally support two alternatives and dependents can choose from two.

Yes, applications can do whatever they want, but if they find UUID doesn't meet their needs, they might simply go away from it and UUID will lose customers. "Who is the customer" might be a considerably arguable question, but I think it's sufficiently sensible to see those who need a compact format as customers because of the fact that many UUID alternatives did choose that way.

Perhaps, we are aiming slightly different goals, @bradleypeabody. You actually do not mean alternative; you want Crockford's to be the single new canonical encoding of UUID (or something close to that), don't you? I wouldn't really agree to that idea. IMO the new RFC should not replace or deprecate the original RFC. In this sense, String() must keep returning 8-4-4-4-12, but still the new RFC can help libraries implement BaseXXString() methods in addition.

Parsing may be an issue, but detection by length works. This strategy is also viable with variable-length IDs if designed properly.

As for the implementation, the following simple C function supports conversion from/to any bases from base-2 to base-256 (aka byte array) using the modulo algorithm. Adding encode/decode functions that map numbers from/to characters is trivial. You can find a complete base36 example that also illustrates how the following code can be optimized.

/** Converts a digit value array in `src_radix` to that in `dst_radix`. */
int convert_radix(uint8_t *dst, const uint8_t *src,
                  int dst_radix, int src_radix,
                  int dst_size, int src_size) {
  for (int i = 0; i < dst_size; i++) {
    dst[i] = 0;
  }

  for (int i = 0; i < src_size; i++) {
    uint_fast32_t carry = src[i];
    for (int j = dst_size - 1; j >= 0; j--) {
      carry += dst[j] * src_radix;
      dst[j] = carry % dst_radix;
      carry = carry / dst_radix;
    }
    if (carry != 0) {
      return -1; // dst_size too small
    }
  }
  return 0;
}

Base-32/64 family can be implemented similarly by defining another convert_radix() function. You can find a JS example in a previous comment. So providing base-36/62 pair or base-32/64 pair is trivial, but mixing the two (e.g. base32/58) requires two convert_radix() functions and thus simply doubles the work.

bradleypeabody commented 2 years ago

@LiosK

You actually do not mean alternative; you want Crockford's to be the single new canonical encoding of UUID (or something close to that), don't you?

I think this question cuts to the core issue here. Yes, I was thinking that we would generally be asking library authors to change the implementation of functions like String() for a UUID to return a Crockford base32 value. But, looking this over, it's possible this is too big of a change to ask people to do.

My general thought on it is that this 8-4-4-4-12 format is just, generally speaking, an inefficient and not-useful format. But I readily admit this is just an opinion and I'm sure there will be plenty of people that disagree, citing for example the readability of hex as an advantage (even though I don't get why this is important for a UUID, but still - it's all a matter of view point).

If there were strong support from library authors for making this change, I think it could work. But so far I suspect it would just be a bunch of people yelling at me about why this change was made, and I don't personally have the time or energy to contact the various library authors and try to gain support for the idea. So then the question is what other options are on the table aside from changing the default format.

And as we look at that, @broofa brought up a really good point on this as well:

This means all bundled code will have to include the base-x encode and decode implementations, even if it’s never used.

I think this is the main disadvantage of the "let people use whatever base they want" approach - it makes every implementation more complicated when only a relatively few will want it.

I'll post my idea on an alternative here next as a separate entry.


And on a separate but related note, I'm also seriously considering writing a set of C function declarations to include as an appendix which would essentially be an example of what an API could look like, and which functions would make sense to combine vs keep separate etc. I feel like that, while still just being only a recommendation, could at least provide library authors with a quick way to understand how all this is intended to be used. I wouldn't want people to take this as some sort of rule about what code they could/couldn't write, but I would like to provide a quick answer to the "so what the heck are you actually expecting me to do with all this?" - it could be really helpful. Such functions could clearly be labelled as functionality as part of the core spec ("MUST" items) vs additional things that are nice to have ("SHOULD", "MAY" and other "not part of the spec but you can do whatever you want here" things). I'll try to hammer this out over the next few days and make a separate issue.