ietf-wg-uuidrev / rfc4122bis

revision to RFC4122
Other
57 stars 11 forks source link

Discuss why v3/v5 are bad for Database Usage #155

Closed sergeyprokhorenko closed 1 year ago

sergeyprokhorenko commented 1 year ago

I propose to add this paragraph to the Section 2.1. Update Motivation:

  1. Hash-based UUID versions such as UUIDv3 (described in Section 5.3) and UUIDv5 (described in Section 5.5) can change their value over time as the hash function argument can unexpectedly change. This makes it difficult to use them as primary keys, including in the Data Vault modeling method.
sergeyprokhorenko commented 1 year ago

FYI, Daniel Linstedt, author of the Data Vault modeling method, is considering replacing hash keys with UUIDv7 if the following conditions are met:

  1. UUIDv7 is available cross-platform in the database management systems that people use around the world
  2. Technical spec sheets are provided
  3. Large-scale performance testing done
  4. The results are published on the forums (https://github.com/uuid6/uuid6-ietf-draft/issues/128)
  5. There must be open list as to which platforms support this, and which don't (https://github.com/uuid6/uuid6-ietf-draft/issues/129)
  6. There are installation and configuration instructions on how to add the functionality to a specific platform
kyzer-davis commented 1 year ago

I was mauling over if it was worth discussing hash-based in update motivation at all.

To be honest that section is from when we were trying to add v6, v7, v8. There are many items I could add as it pertains to rewriting 4122 in the update motivations (bit layouts for all, implementation best practices, hashing info, new security guidelines, better formatting, etc, etc.) but IMO I don't think they are needed now. The document has been updated and the point is somewhat moot.

sergeyprokhorenko commented 1 year ago

@kyzer-davis

It seems to me that it is still worth adding the proposed paragraph or alternative. This is not a matter of controversy, which you would reasonably want to avoid. The fact is that the main area of application of UUID is data warehouse (DWH), especially in banks. The main method for modeling DWH is Data Vault. Data Vault recommends using hash keys generated from composite business keys as primary keys. Many large banks use Data Vault, but use integer keys (auto-increment) instead of hash keys, because hash keys are not really suitable as primary keys. This is a defect in the DWH modeling method. The Data Vault author cannot afford to recommend UUIDv7 until UUIDv7 is widely adopted. The proposed paragraph provides guidance in choosing a UUIDv7 version for DWH.

As an alternative, I might suggest adding to the Section 6.12. DBMS and Database Considerations new last paragraph:

In data warehouses, it is recommended to use UUIDv7 generated by the DBMS whenever possible. Hash-based UUIDs are not recommended because they change their value when the hash function argument changes unexpectedly.

LiosK commented 1 year ago

v3 and v5 are stable. The new RFC won't change that fact. We shouldn't add a paragraph that puts it in doubt.

sergeyprokhorenko commented 1 year ago

v3 and v5 are stable. The new RFC won't change that fact. We shouldn't add a paragraph that puts it in doubt.

Use v3 and v5 outside data warehouses. We can add a phrase that previously generated UUIDs of any version do not need to be recoded to UUIDv7.

The vast majority of developers put any UUID in doubts. They use auto-increment (integers). Now let's pray for auto-increment like for v3 and v5?

If you google you will see that everyone is looking forward to UUIDv7, although they have had the opportunity to use v3 and v5 for 18 years.

LiosK commented 1 year ago

You know everyone is waiting for v7, so we can neutrally introduce v7. Many people will just adopt v7. You don't need to be mean to v3/v5 at all.

v7 and v3/v5 solve different problems. Implementers will pick one up based on their criteria. We don't need to lecture or steer them towards v7.

Plus, you don't need to pray for auto-increment keys. Where applicable, a 32-bit auto-increment key can be way better off than v7 because it takes only 1/4 space of v7, which leads to much better index size and locality. Personally, I don't really understand why you stick to v7 so much in the DWH area. Auto-increment keys seem much better than v7 when decentrality is not the requirement.

sergeyprokhorenko commented 1 year ago

You don't need to be mean to v3/v5 at all.

You need to read my argument about Data Vault again.

LiosK commented 1 year ago

Read it and came to the same conclusion. You don't need to be mean to v3/v5. If v3/v5 are not useful, implementers just don't use them, no matter what the RFC says.

sergeyprokhorenko commented 1 year ago

We don't need to lecture or steer them towards v7.

The main goal of any standard is to steer people. But the true goal of this standard is popularisation of v7 because perfect ULIDs have not support from a standard and therefore ULIDs are not so popular.

LiosK commented 1 year ago

the true goal of this standard is popularisation of v7

So the new RFC introduces v7 as a rigid standard. That's it. What else?

sergeyprokhorenko commented 1 year ago

Personally, I don't really understand why you stick to v7 so much in the DWH area. Auto-increment keys seem much better than v7 when decentrality is not the requirement.

I cannot express my 20 years experience in one answer. You have to believe. Auto-increment sooner or later leads to chaos.

sergeyprokhorenko commented 1 year ago

the true goal of this standard is popularisation of v7

So the new RFC introduces v7 as a rigid standard. That's it. What else?

Popularisation! It's not enough to add v7 to the text. We need to describe its advantages and application

sergeyprokhorenko commented 1 year ago

Read it and came to the same conclusion. You don't need to be mean to v3/v5. If v3/v5 are not useful, implementers just don't use them, no matter what the RFC says.

Implementers and their bosses obey standards because don't want to be fired

LiosK commented 1 year ago

An RFC is not an ad but is a technical standard. It must focus on the technical point of view. Biased favor to v7 without rigid technical grounds will harm the overall integrity of the new RFC. UUIDv7 isn't the single almighty identifier methodology for DWH or any other use cases; every identifier scheme has its own advantages, disadvantages, and use cases. An RFC must not steer everyone to v7 where v7 might not be the best option for their specific use cases. This attitude would be even more important if implementers and their bosses were to obey standards.

sergeyprokhorenko commented 1 year ago

An RFC is not an ad but is a technical standard. It must focus on the technical point of view. Biased favor to v7 without rigid technical grounds will harm the overall integrity of the new RFC. UUIDv7 isn't the single almighty identifier methodology for DWH or any other use cases; every identifier scheme has its own advantages, disadvantages, and use cases. An RFC must not steer everyone to v7 where v7 might not be the best option for their specific use cases. This attitude would be even more important if implementers and their bosses were to obey standards.

These are arguments like "any shitcode has the right to exist if the developer likes it so much. It's nothing that billions are spent on correcting defects. You can't ban the recipes of healers and advertise evidence-based medicine!"

As for the technical point of view, I repeat: hash-based UUIDs are not recommended because they change their value when the hash function argument changes unexpectedly. I have not received any justified objections to this claim.

danielmarschall commented 1 year ago

I don't understand this discussion. Name-based have a completely different use case than time-based and random-based. If name-based is useful or not completely depends on which application you are working. If you are working a lot with Data Warehouse, then it is your preferred UUID version.

If the input of a name-based UUID changes, then the output changes. Yes. That is stating the obvious.

That being said, personally I prefer UUIDv6 over UUIDv7 due to the better time resolution, plus I think including the MAC identifying the generating system is a pro not a con, especially when debugging. If you genereate UUIDv7 in a batch script you have to include a 1ms delay to make sure that the IDs are really ordered. 1ms is a long time if you do batch processing of a bunch of data (DWH?)...

sergeyprokhorenko commented 1 year ago

every identifier scheme has its own advantages, disadvantages, and use cases

This statement is completely contrary to reality. UUIDv7 was created precisely to eliminate the terrible disadvantages of previous versions. Because of these disadvantages, UUIDs are not widely used as database keys. You won't find any advantage of other versions over UUIDv7. The exception is cases when hash-based UUID allows you to quickly bungle an unreliable information system. It would be correct to deprecate previous versions, but this is impossible to do, since it is necessary to support legacy systems.

sergeyprokhorenko commented 1 year ago

@danielmarschall

If you are working a lot with Data Warehouse, then it is your preferred UUID version.

It's not my opinion only. UUIDv7 is objectively the best for Data Warehouses.

If the input of a name-based UUID changes, then the output changes. Yes. That is stating the obvious.

It's right. I was waiting for this from @LiosK

That being said, personally I prefer UUIDv6 over UUIDv7 due to the better time resolution

No. UUIDv7 with counter has better time resolution than UUIDv6. But they have both the same locality in DBMS. This means the same performance

plus I think including the MAC identifying the generating system is a pro not a con, especially when debugging.

No. Mac addresses are not unique and they pose a cyber threat

If you genereate UUIDv7 in a batch script you have to include a 1ms delay to make sure that the IDs are really ordered.

No. Absolute monotony is impossible and unnecessary. So no delays

LiosK commented 1 year ago

If the input of a name-based UUID changes, then the output changes. Yes. That is stating the obvious.

It's right. I was waiting for this from @LiosK

Perhaps, I misread this point. Yes, the hash output changes if the input changes. Hash-based versions are not recommended for use cases where the input may change. That's self-evident from the spec, and we don't need to add any paragraph. For the other use cases (including some DWH), v3/v5 is viable. It's up to implementers to choose the best solutions for their problems.

These are arguments like "any shitcode has the right to exist if the developer likes it so much. It's nothing that billions are spent on correcting defects. You can't ban the recipes of healers and advertise evidence-based medicine!"

Other UUID versions as well as auto-increment keys are not sh*t code. They have different use cases and solve different problems.

That said, what's wrong with sh*t code? It's up to implementers. It's their freedom. An RFC is not for kindergarten kids. You don't need to lecture them.

I know how much you love v7. But v7 is just one of many right tools to solve specific problems, and so are other alternatives. It shouldn't be a goal of the new RFC to build a v7 empire in the DWH world, or in other words, few people contributing to the project will agree with that direction.

danielmarschall commented 1 year ago

This statement is completely contrary to reality.

No. As @LiosK agrees with me, there are several use-cases for several UUID versions.

Basically, I think there are two types of UUIDs (and possibly just two categories of use cases):

The fact is that the main area of application of UUID is data warehouse (DWH), especially in banks.

UUIDs are used in nearly every software project in this world, not just data warehouse. I don't know if there is a statistic of the "top 10 use cases", but I think one of the most important are the GUIDs for Microsoft COM. Without the thousands of pre-installed COM interfaces on each system, Windows wouldn't even boot.

That being said, personally I prefer UUIDv6 over UUIDv7 due to the better time resolution, plus I think including the MAC identifying the generating system is a pro not a con, especially when debugging. If you genereate UUIDv7 in a batch script you have to include a 1ms delay to make sure that the IDs are really ordered. 1ms is a long time if you do batch processing of a bunch of data (DWH?)...

In re my previous comment about time resolution of UUIDv7, I retract that argument because I learned about the sub-millisecond bits. Occuping 12 more time bits results in a resolution of 245 nanoseconds which is very nice.

I still love machine identifiers in my UUID though, and EUI are unique. I am just assuming that every machine has a network card. I think I never saw a computer without network card before (except for my first DOS computer), and probably you can't even buy a computer without network card today, because they are already included on the mainboard.

sergeyprokhorenko commented 1 year ago

@danielmarschall

EUI are unique

No. It turned out that some motherboards can have the same MAC addresses: https://www.networkworld.com/article/2340364/are-mac-addresses-really-unique-.html

sergeyprokhorenko commented 1 year ago

@LiosK

Hash-based versions are not recommended for use cases where the input may change.

I'm glad that you don't deny the obvious now

For the other use cases (including some DWH), v3/v5 is viable.

No. DWH is exactly the use case where any identifier (i.e. input of hash function) can change. No one is immune from changing identifiers. This is confirmed by many years of practice. Nothing is perfect and unchangeable. Billions are being spent on finding errors and making changes.

But the variability of identifiers is not at all obvious, so the requirement not to use hash-based UUIDs for DWH should be written into the RFC.

It's up to implementers to choose the best solutions for their problems... That said, what's wrong with sh*t code? It's up to implementers. It's their freedom. An RFC is not for kindergarten kids. You don't need to lecture them.

No. This is a standard, not a supermarket, where you can choose whatever you want. For example, you cannot choose the unit of length in the documentation: meter, feet or parrot.

The customer doesn't want to buy shitty code. The customer wants to insure against the arbitrariness of the contractor, and therefore demands compliance with the standard. If the standard says "do what you want", then such a standard is useless, and a voluminous and precise specification will be used instead. The value of UUID will be devalued, as it already was 18 years ago.

Few people contributing to the project will agree with that direction.

I don't think so. And by the way, voting cannot change reality. And the authors and those who approve the RFC, not the contributors, bear responsibility for the quality of the RFC.

LiosK commented 1 year ago

DWH is exactly the use case where any identifier (i.e. input of hash function) can change.

It depends on individual DWHs. It's the DWH designer's responsibility to choose a right identifier scheme for their own DWH.

The customer wants to insure against the arbitrariness of the contractor, and therefore demands compliance with the standard.

The customer can simply say "use UUIDv7". The standard is just there. That's enough. I have no idea what problem you are solving.

kyzer-davis commented 1 year ago

I modified the title of this one to cover the topic of the issue.

To note, I did make some proposed changes via https://github.com/ietf-wg-uuidrev/rfc4122bis/commit/529860f21c3b4da22210ebd35a083c7c27e4225a but these have been removed for the time being. The idea was to call out how v3/v5, with inputs that change could cause bad performance to a database.

IMO not as an author, I am not sure this is entirely needed but the commit itself covers the point (which is more or less the same as v4 in terms of impact on a database.) So it fit nicely in bullet 1 of update motivations.

Again, this time as an author, I want to re-iterate, we should not overthink/over-iterate on this. Weigh the pros/cons of having this in the doc vs not having it in the doc.

sergeyprokhorenko commented 1 year ago

Explanations:

https://github.com/ietf-wg-uuidrev/rfc4122bis/issues/157#issuecomment-1737799318

https://github.com/ietf-wg-uuidrev/rfc4122bis/issues/157#issuecomment-1737958679

bradleypeabody commented 1 year ago

My take on this is I think it's fair point that database designers are tempted to use hash-based UUIDs because of their existence, but then later find out that this is a bad design choice because inputs they thought would not change actually do. I've run into this myself a number of times.

That said, since we're not removing v3/v5 from the spec, it's physically possible for people to use hash-based IDs in databases - the only control we have over the process is what advice we give in the RFC. So it boils down to the same thing: what do we want to say about this in the document?

How about adding this paragraph to the "DBMS and Database Considerations" section:

Designers of database schema are cautioned against using hash-based UUIDs as primary keys in tables. A common issue that has been observed in database schema design is the assumption that a particular value will never change, which then later turns out to be an incorrect assumption. Postal codes, license or other identification numbers, and numerous other such identifiers, even those which seem or maybe even are unique and unchanging at a given point time, can then later turn out to have edge cases where they need to change, thus invalidating your database structure. In such scenarios it is noted that using any non-hash-based UUID version would have resulted in the field in question being placed somewhere (not as a primary key) that would have been easier to adapt to such changes. The same arguments that result in the general advice to use surrogate keys instead of natural keys apply here as well.

(It would be nice, although not necessarily required, to cite a good source for this advice, something a bit more authoritative than wikipedia.)

bradleypeabody commented 1 year ago

Sorry, didn't mean to close. Somehow I hammered the wrong button.

sergeyprokhorenko commented 1 year ago

I would only change the word hash-based for name-based according to https://github.com/ietf-wg-uuidrev/rfc4122bis/issues/154.

kyzer-davis commented 1 year ago

I took @bradleypeabody's text, modified it a bit to fit the document and added it to the end of the section. See https://github.com/ietf-wg-uuidrev/rfc4122bis/pull/168/commits/ea92c30b98719cb0fdd77a29f6fd2a26714575c4

Will check next week on any text changes. My personal deadline for Draft-13 back to IETF is Oct 20th.