Draft 03: Recommend Database Side UUID Generation

sergeyprokhorenko commented 2 years ago

UUIDs SHOULD be generated by database when inserting a record rather than by applications. This ensures better UUID monotonicity in the database tables and in indexes. Additional IDs of applications messages MAY be used to ensure integrity and feedback.

kyzer-davis commented 2 years ago

@sergeyprokhorenko,

Would this text fit in Section 5.8? If possible use the new change proposal issue template to help me iterate on edits quickly! :)

broofa commented 2 years ago

I find this text oddly specific, and fundamentally at odds with the "no centralized authority" premise for UUIDs.

Heck, one of the biggest use cases - arguably the biggest use case - for UUIDs is in mobile applications where the client app needs to create new records while offline that will later be synced to the database.

sergeyprokhorenko commented 2 years ago

@broofa I understand your emotions. If UUIDs are used, then the ease of merging records from different sources into one table is fascinating. But if the sources generate thousands of UUIDs every tick of the timestamp, then this simplicity can significantly worsen the monotonicity of UUIDs in table index.

In addition, such a decentralized architecture is only suitable for storing non-normalized data (for error logs, message queues, fact tables). During data normalization, table records are generated in the database itself. Therefore keys for such records need to be generated in the database itself.

As you can see, I used the word SHOULD, not MUST. This proposal is for both multi-node high performance systems and for single-node systems. If high performance is not important (for example, if the data will be transferred to other tables for later reading), then UUIDs MAY be generated on the application side.

This proposal is intended to motivate DBMS developers to implement this new standard.

broofa commented 2 years ago

We should not be in the business of lecturing readers on how UUIDs should or should not be used. Any efforts we make in this direction will almost certainly be met with derision. For example, if a reader is concerned with the time at which a DB record is first created on a client device rather than the time it is inserted into the DB, your advice no longer applies. (And, if it does, the explanation for that should be treated as out of scope.)

Basically any use of terms such as "MAY", "SHOULD" or "SHOULD NOT" when describing how UUIDs might be used or what they might be used for should be considered a "code smell" where we are overstepping our bounds as spec authors. Such verbiage should be revisited so it focuses on why the decisions in this spec were made, with readers left to draw their own conclusions about how well such decisions will fit with their particular needs.

sergeyprokhorenko commented 2 years ago

@broofa You missed the main point: This proposal is intended to motivate DBMS developers to implement this new standard. Otherwise, the failure of ULIDs will repeat for UUIDv7E: ULIDs are implemented in all popular programming languages, but they are not in any of the DBMS.

broofa commented 2 years ago

If this proposal is to motivate people, it needs to do so by demonstrating its merits, not by dictating how people should or should not use UUIDs.

I would suggest alternate wording to what you propose, but relying on a DB to insure monotonicity of UUIDs is effectively the same thing as the "Shared Knowledge System" in §5.3, which is already covered and (ironically?) described as "out of scope".

sergeyprokhorenko commented 2 years ago

If this proposal is to motivate people, it needs to do so by demonstrating its merits, not by dictating how people should or should not use UUIDs.

@broofa Where did you notice dictating here? This is a recommendation based on real experience and reasoned. And what rational arguments against do you have, besides your dislike for "centralized authority" (i.e. for database)?

Supporters of ULIDs and their analogues have tried unsuccessfully for ten years to demonstrate the merits of ULIDs to DBMS developers. Finally, the enthusiasts of this project wisely decided that the power of the standard sounded more convincing. And I want to help them with this.

I would suggest alternate wording to what you propose, but relying on a DB to insure monotonicity of UUIDs is effectively the same thing as the "Shared Knowledge System" in §5.3, which is already covered and (ironically?) described as "out of scope".

There is nothing in common between (1) generating UUIDs on the database side and (2) surrogates of MAC addresses of source systems in the form of Shared Knowledge System. You can't call the database a bottleneck, because the database is where all the data ends up going.

bradleypeabody commented 2 years ago

@sergeyprokhorenko I think, going back to the text you proposed, the issue is that while your suggestion is good for certain use cases, there are definitely plenty of use cases where the database cannot generate the ID, examples:

Database system hasn't been updated to support UUID generation
Multiple applications or instances need to create and insert records without coordinating with each other first.
Data is stored in multiple databases first and then merged later (e.g. regions each with their own database that are then centrally combined into some sort of warehouse - in which case which "database" does your text refer to?)
Record IDs are generated in an environment without access to the database but the data is later stored in a database - an IoT device generating and transmitting information which is then, seconds or minutes later, stored in a database. The thing generating the ID has no access to the database at all.
broofa's point about UUIDs being generated in web pages and then sent to a server is another good example.

That said, I suggest proposing different wording that encourages database vendors to make such functionality available so applications can take advantage of it, rather than saying that applications should use it - something that could work. For example:

"Database vendors are encouraged to provide functionality to generate and store UUIDs, so applications can easily produce unique, monotonic identifiers which have good index locality (time ordered)."

I don't see a problem with adding in something like that which I think is more directly targeted at the point: we want database vendors to implement UUIDs.

sergeyprokhorenko commented 2 years ago

@bradleypeabody

I propose this wording:

DBMS vendors are encouraged to provide functionality to generate and store UUIDv6, UUIDv7E and UUIDv8E-type identifiers as primary keys, surrogate keys for temporal databases, foreign keys including in polymorphic relationships, keys for key–value pairs (in JSON columns, key–value databases).

broofa commented 2 years ago

Where did you notice dictating here?

Pretty much what Bradley said. By saying that ids should be generated by databases "rather than by applications", it prioritizes DBs over other sources.

Regarding the new proposed text, I'm fine with what @bradleypeabody proposes.

kyzer-davis commented 2 years ago

Good discussion group,

Back to my original question so I can merge some text:

Would the proposal text fit in Section 5.8?

sergeyprokhorenko commented 2 years ago

@kyzer-davis No, I would add new Database Side UUID Generation section just before or after Distributed UUID Generation section

I would suggest the following content:

DBMS vendors are encouraged to provide functionality to generate and store UUIDv6, UUIDv7E and UUIDv8E-type identifiers as primary keys, surrogate keys for temporal databases, foreign keys including in polymorphic relationships, keys for key–value pairs (in JSON columns, key–value databases).

Database side UUID generation ensures best UUID monotonicity in the database tables and in indexes.

Additional IDs of applications messages MAY be used to ensure integrity and feedback.

peterbourgon commented 2 years ago

Database side UUID generation ensures best UUID monotonicity in the database tables and in indexes.

I don't think you can make a generalizing statement like this. Databases aren't necessarily monolithic. Generating UUIDs in a distributed database can easily provide the same or worse monotonicity compared to generating them at the application layer.

sergeyprokhorenko commented 2 years ago

Database side UUID generation ensures best UUID monotonicity in the database tables and in indexes.

I don't think you can make a generalizing statement like this. Databases aren't necessarily monolithic. Generating UUIDs in a distributed database can easily provide the same or worse monotonicity compared to generating them at the application layer.

I agree. Another wording:

If database is monolitic than database side UUID generation ensures best UUID monotonicity in the database tables and in indexes.

kyzer-davis commented 2 years ago

@sergeyprokhorenko, what if I renamed Section 5.8 to 'Database Considerations' vs it's current name that way we can put put all the database specific things like this in one neat and tidy spot. I would much rather split opacity out to a 5.9 since the comma in the section has been bugging me.

sergeyprokhorenko commented 2 years ago

what if I renamed Section 5.8 to 'Database Considerations' vs it's current name that way we can put put all the database specific things like this in one neat and tidy spot. I would much rather split opacity out to a 5.9 since the comma in the section has been bugging me.

@kyzer-davis This is a good decision. I would rename it to DBMS and Database Considerations, because DBMS is an acquired tool for creating and operating a database. Both DBMS and database require our attention.

kyzer-davis commented 2 years ago

@sergeyprokhorenko,

The section changes from my previous comment have been included in #85. Let me know what you think and let's discuss here.

uuid6 / uuid6-ietf-draft

Draft 03: Recommend Database Side UUID Generation #79