Additional Update Motivations

LiosK commented 1 year ago

For example:

The fact that UUIDs can be used to create unique, reasonably short values in distributed systems without requiring coordination makes them a good alternative, but UUID versions 1-5, which were defined by {{RFC4122}}, lack certain other desirable characteristics:

It's an obvious fact for us, but that might not be the case decades later....

EDIT: I've got wondering that a concise text stating that the new RFC adds v6/v7/v8/Max and inherits other versions from RFC 4122 in a compatible manner would help both current and future readers understand the changes made as well as the historical context. Do you think we can sneak one into somewhere in the draft?

sergeyprokhorenko commented 1 year ago

If something changes decades later, descendants will make changes to the RFC. The section describes the motivation for the new version of the RFC, but not for centuries

LiosK commented 1 year ago

That's not the point. It'll be difficult for future readers to understand the section without the knowledge that v6-v8 didn't exist at the time of writing.

sergeyprokhorenko commented 1 year ago

@LiosK

It'll be difficult for future readers to understand the section without the knowledge that v6-v8 didn't exist at the time of writing.

We can add the phrase that versions 6 and 7 were added to eliminate significant disadvantagies of the previous versions 1-5. Version 8 was added so as not to limit the creative freedom of implementers.

LiosK commented 1 year ago

Again, you don't need to mean to the old versions. The problems of the old versions are perfectly expressed by the current draft.

kyzer-davis commented 1 year ago

Updated Title for the changelog. I added this + the other items I wanted go get into update motivations: https://github.com/ietf-wg-uuidrev/rfc4122bis/pull/152/commits/f85016fc2c030377620b739b984cf698bd57b12d

Basically "why did we replace 4122 instead of just update it" in six succinct bullets.

sergeyprokhorenko commented 1 year ago

@kyzer-davis

Updated Title for the changelog. I added this + the other items I wanted go get into update motivations: f85016f

Basically "why did we replace 4122 instead of just update it" in six succinct bullets.

From the rationale, it seems that the huge project to replace the RFC was carried out due to some vague minor reasons. The real reasons (see ULID) are hidden:

UUID can be suboptimal for many use-cases because:

It isn't the most character efficient way of encoding 128 bits of randomness (this hasn't been fixed yet, see this)

UUID v1/v2 is impractical in many environments, as it requires access to a unique, stable MAC address

UUID v3/v5 requires a unique seed and produces randomly distributed IDs, which can cause fragmentation in many data structures

UUID v4 provides no other information than randomness which can cause fragmentation in many data structures

It’s not good to mislead people that everything is OK with versions 1-5, and the problems are only in minor typos in the RFC text.

I want to remind you how this project began. The new RFC will become popular only thanks to its immediate predecessor and inspiration: ULID with sequence (i.e. counter).

kyzer-davis commented 1 year ago

@sergeyprokhorenko, To be clear: I did not remove the original update motivations. I simply added additional bullets from an earlier interim meeting discussing the reason behind the IETF's decided to create the 4122bis item vs just adding v6-8 as an RFC that updated 4122 rather than replaces it.

The items discussing 1-5 as suboptimal and introduction of v6/v7 still exist as leading text in the update motivations section.

sergeyprokhorenko commented 1 year ago

@kyzer-davis

Thank you for the clarification!

@sergeyprokhorenko, To be clear: I did not remove the original update motivations. I simply added additional bullets from an earlier interim meeting discussing the reason behind the IETF's decided to create the 4122bis item vs just adding v6-8 as an RFC that updated 4122 rather than replaces it.

The items discussing 1-5 as suboptimal and introduction of v6/v7 still exist as leading text in the update motivations section.

But of the significant disadvantagies of versions 1-5 listed in the ULID, the following disadvantagies has not yet been included in the RFC:

UUID v3/v5 requires a unique seed and produces randomly distributed IDs, which can cause fragmentation in many data structures

I think by seed they mean the hash function argument, which is unexpectedly changeable in 99% of cases. The second part of this disadvantage is equivalent to the version 4 disadvantage listed in the RFC.

kyzer-davis commented 1 year ago

NP, and thanks for the last comment, I fully understand why one would want to cover that in a bullet. Which is also the original ask in #155 (I just caught up on a moment ago.)

Let me discuss real fast with Brad on some text that could be added to update motivations to specifically call out how hash-based UUIDs don't work nicely for databases so we can bookend that item (and the rest of my PR for draft-12) and move onto the IANA/SHA256 branches with #152 as my base.

bradleypeabody commented 1 year ago

UUID v3/v5 requires a unique seed and produces randomly distributed IDs, which can cause fragmentation in many data structures

I think this would be covered by just tweaking the first bullet point to explain that UUIDs involving hashes have the same problem.

Current text:

Non-time-ordered UUID versions such as UUIDv4 (described in {{uuidv4}}) have poor database index locality. This means that new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The resulting negative performance effects on common structures used for this (B-tree and its variants) can be dramatic.

How about this instead (only the first sentence changed):

Non-time-ordered UUID versions such as UUIDv4 (described in {{uuidv4}}), as well as UUID versions 3 ({{uuidv3}}) and 5 ({{uuidv3}}) which use hash functions with even bit distribution, have poor database index locality. This means that new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The resulting negative performance effects on common structures used for this (B-tree and its variants) can be dramatic.

This is the very first bullet point in the Update Motivations section, prominent. I'm not stuck on that being the exact change, but it seems like it covers it to me.

sergeyprokhorenko commented 1 year ago

Great! But the problem that

UUID v3/v5 requires a unique seed

is not mentioned. I would add a new bullet point:

UUID versions 3 ({{uuidv3}}) and 5 ({{uuidv3}}) which use hash functions are at risk of unexpectedly changing hash function argument.

because hash-based versions are not recommended for use cases where the input may change, and the risk may be underestimated

danielmarschall commented 1 year ago

Name-based UUIDs do not require an unique seed.

Again, it depends on the use-case.

If the use case is to have an unique ID for a database, yes, then you need an unique seed. But then UUIDv3(getRandomBytes()) gives the same contents as UUIDv4() (except for the version bits). So there is no need to use name-based UUID in that case. Use UUIDv7 instead.
If the use-case is to receive a UUID representation of an URL/OID/etc. , then UUIDv3 and UUIDv5 is the way to go, e.g. UUIDv3(NS_OID, '2.999') . This is NOT an "Unique Identifier" (so the name UUID can be confusing), it is a name-based representation.

I think by seed they mean the hash function argument, which is unexpectedly changeable in 99% of cases

I do not know what the 99% are. Can you give me a real-world example?

An OID 2.999 stays OID 2.999 to all eternity, and so does its UUID representation.

URL www.example.com stays URL www.example.com for all eternity, and so does its UUID representation.

bradleypeabody commented 1 year ago

@sergeyprokhorenko Sorry, I'm not trying to drag this on unnecessarily, but can you elaborate a bit more on what it is that you're concerned someone might mess up about this? (I don't fully understand the concern because a hash function by definition gives the same output with the same input, but maybe if I understand what you think people might get wrong about this, that might clarify.)

fabiolimace commented 1 year ago

and the risk may be underestimated

I understand your concern, but it's not necessary. The risk is intrinsic. If you change 1 bit of the input of a cryptographic hash function, the output will be different with a very high probability. This is how hash functions work. It is more accurate to call this a guarantee rather than a risk.

However, if you really want this to be in the new document, how about this paragraph:

Considering that one of the properties of identifiers is immutability (during the lifetime of the data to which they are attached) and that hash functions produce different results for each input, it is not recommended to employ name-based UUIDs generated with arguments that may change in the future.

For example, we know that people names can change when they get married, so it is not recommended to use people names to generate name-based UUIDs. I don't believe anyone would actually do this, it's just a basic example.

LiosK commented 1 year ago

I think by seed they mean the hash function argument, which is unexpectedly changeable in 99% of cases

This is not necessary as I discussed in #155. Please don't cross-post the same argument in every thread.

as well as UUID versions 3 ({{uuidv3}}) and 5 ({{uuidv3}}) which use hash functions with even bit distribution,

I don't think this is necessary either. This insertion blurs the point of the statement. v3/v5 have very different properties and uses cases than v1/v4, so while it is very intuitive and likely that v7 replaces v1/v4 just for the index locality issue, the same thing isn't likely to happen for v3/v5 because hash-based vs. time-ordered needs a serious design decision.

kyzer-davis commented 1 year ago

https://github.com/ietf-wg-uuidrev/rfc4122bis/commit/529860f21c3b4da22210ebd35a083c7c27e4225a has been posted. I did not use the word seed but rather input "name" data to tie it back to the rest of the document's verbiage.

I think this covers the bullet sufficiently and draws attention to what we all know: v3/v5 with input data that changes will basically be the same as v4 (in the grand scheme of a database identifier).

I don't want to over iterate on this. If the text conveys this point please thumbs up so I can merge draft 12 down and move onto larger items with the remaining time I have this week.

LiosK commented 1 year ago

I don't think https://github.com/ietf-wg-uuidrev/rfc4122bis/commit/529860f21c3b4da22210ebd35a083c7c27e4225a is necessary. The whole point is perfectly clear with the original concise text, and this commit inserts some irrelevant and confusing concepts. Particularly, it's quite odd to discuss v3/v5 in the sortability context because the hash-based properties of v3/v5 require a tailored identifier design, which might not stand together with index locality (for example, imagine Git and BitCoin, as very successful applications with hash-based IDs, though not UUIDs).

sergeyprokhorenko commented 1 year ago

I don't think 529860f is necessary. The whole point is perfectly clear with the original concise text, and this commit inserts some irrelevant and confusing concepts. Particularly, it's quite odd to discuss v3/v5 in the sortability context because the hash-based properties of v3/v5 require a tailored identifier design, which might not stand together with index locality (for example, imagine Git and BitCoin, as very successful applications with hash-based IDs, though not UUIDs).

Bitcoin is widely known for its enormous transaction confirmation times. This is a great argument for UUIDv7

sergeyprokhorenko commented 1 year ago

@sergeyprokhorenko Sorry, I'm not trying to drag this on unnecessarily, but can you elaborate a bit more on what it is that you're concerned someone might mess up about this? (I don't fully understand the concern because a hash function by definition gives the same output with the same input, but maybe if I understand what you think people might get wrong about this, that might clarify.)

@bradleypeabody

I'll give you an example. Let's assume that an ordinary systems analyst decided to identify regional offices by hash-based UUIDs. He decided to cocatenate the country code and the postal code as the hash function argument. This decision seemed reasonable to him (because he didn't want to make his job more difficult). For three years everything was OK, and this systems analyst even managed to quit and get a more attractive position in another company. But suddenly the postal code of one of the regional offices changed. And now there are hundreds of thousands of records in the database with two different hash-based UUIDs of the same regional office. Now the new systems analyst tasked with fixing the defect curses his predecessor.

Unfortunately, any arguments of hash functions can change unexpectedly in the future. This is well known to database designers who try to use surrogate keys instead of natural (or business) keys. Changes to natural keys are specifically recorded with timestamp in temporal databases including DWH for communication with the outside world. But DWH tables are linked by surrogate keys only.

bradleypeabody commented 1 year ago

@sergeyprokhorenko Thanks for that explanation, totally makes sense.

In line with your points above, I think those concerns are implied when using an hash function, but you're absolutely right that historically people have screwed this up by not fully thinking through the implications and picking the wrong solution for a particular use case.

If you feel strongly that this is worth making an edit for, perhaps a simple sentence or two with a reference to a third party source that talks about the concerns in more detail would work. E.g. adding a line like this in the v3 and v5 sections, and if appropriate v8 hash example:

Implementors using UUIDs that utilize hash functions as above, especially when used as a database key, should carefully consider the implications of the possibility of the hash function input changing in the future. Hash functions as used here imply a "natural key" (see https://en.wikipedia.org/wiki/Natural_key), whereas database maintenance often benefits from the use of "surrogate keys" (see https://en.wikipedia.org/wiki/Surrogate_key).

Or it's possible there might be a better place in the document for this warning, such as the Name-based UUID Generation section.

kyzer-davis commented 1 year ago

FYI, I backed out https://github.com/ietf-wg-uuidrev/rfc4122bis/commit/529860f21c3b4da22210ebd35a083c7c27e4225a so I can get draft-12 base PR merged down and start work on the other two branches I need to get done ASAP. Let's take this v3/v5 DB discussion back to #155 as I am going to close this issue in the PR.

ietf-wg-uuidrev / rfc4122bis

Additional Update Motivations #157