UUID version for proprietary formats

broofa commented 3 years ago

Note: Using the term "proprietary format" here to refer to any uuid formats that are user-specific in a way that doesn't [yet] merit formal RFC specification. (This notion comes from the world of MIME types, where the MIME standard allows for non-standard types, given a suitable "tree prefix" on the type string.)

In #30, I've argued that the current version 8 spec is too vague to carry much meaning. It defines timestamp, node, and clockseq fields in the most general way possible, allowing for arbitrary bit lengths in all fields. But, in doing so, fails to meaningfully define any of these fields. Furthermore, the spec is prefaced with a variety of provisions about when version 8 should and should not be used, all of which scream, "ONLY USE THIS FOR EXPERIMENTAL OR PROPRIETARY FORMATS".

In short, I believe version 8 is at risk of falling afoul of the classic "wanting the best of both worlds, but getting the worst" design process. It's a spec that is overly strict for proprietary use cases, and overly vague for timestamp cases. Thus, I believe the right approach is to combine the existing versions 6-8 into a single, new, timestamp uuid version, as discussed in issue #30).

Then, for use cases that require a proprietary format, have a new version that is dedicated to that purpose, and that makes as few assumptions about the nature of the contained proprietary format as possible. In other words, the spec should boil down to the following:

variant bits as defined in RFC4122
version bits set to 1000 (version 8, or whatever version # ends up being used)
All other bits available for user data. The one provision here being that users are encouraged (but not required) to structure fields in a way that maximizes database locality (by placing the most stable fields / bits in the most significant bits of the UUID).

edo1 commented 3 years ago

All other bits available for user data

Is there a reason not to use version=0b0100 (v4) for this?

broofa commented 3 years ago

Is there a reason not to use version=0b0100 (v4) for this?

There is a guarantee of uniqueness that stems from using "cryptographic quality" random number sources. That guarantee breaks down anywhere version 4 uuids are generated through other means.
Users may make assumptions about how v4 uuids are distributed in the uuid "space" that affect system performance. Witness the driving issue of DB locality as it relates to this spec.
If v4 (random) uuids get comingled with v4 (proprietary) uuids, there won't be any 100% reliable way of distinguishing between them. This could be problematic if, for example, a (random) v4 uuid is parsed as a semantically meaningful (proprietary) v4 uuid.

Note: One possible use case for a proprietary UUID format would be a "GIS UUID" that encodes latitude-longitude location.

fabiolimace commented 3 years ago

UUIDv4 is expected to be generated from truly random or pseudo-random numbers. There are some implementations that prepend the timestamp in the UUIDv4, like in COMB-GUID, but they are not a strict UUIDv4.

I think a version for proprietary formats is an interesting thing to prevent people from using UUIDv4 for this purpose. UUIDv8 can be the version for proprietary formats.

edo1 commented 3 years ago

If v4 (random) uuids get comingled with v4 (proprietary) uuids, there won't be any 100% reliable way of distinguishing between them. This could be problematic if, for example, a (random) v4 uuid is parsed as a semantically meaningful (proprietary) v4 uuid

How could application distinguish between the v7 sub-variants? And IMO UUID parsing is a bas idea at all.

Users may make assumptions about how v4 uuids are distributed in the uuid "space" that affect system performance.

So "v8 is any btree-friendly (sortable) application-defined UUID"?

Note: One possible use case for a proprietary UUID format would be a "GIS UUID" that encodes latitude-longitude location.

Doubt this is a good idea, something like postgres gist index should be used instead of b-tree.

edo1 commented 3 years ago

There are some implementations that prepend the timestamp in the UUIDv4, like in COMB-GUID, but they are not a strict UUIDv4.

I understand this. But what is the difference (except sortability) between v4 and v8 for the reader? An application cannot rely on the internals of UUID v8 because there is no good way to know how a particular UUID was generated.

broofa commented 3 years ago

And IMO UUID parsing is a bas idea at all.

Alas, this isn't something any of us have control over. Users are going to do whatever they deem right. If this spec doesn't define a sandbox for users that want to encode/decode proprietary information in uuids then they're likely to do exactly what you suggest, and use version 4 (or 1 or 5 or 3 or whatever). That your first instinct was to suggest people use version 4 uuids for proprietary formats is a good demonstration of the problem.

A proprietary version would at least tell people where the guardrails are.

UUIDv8 can be the version for proprietary formats.

Are you suggesting the current v8 proposal works for this? As I noted above, I think it's overly restrictive in its current form.

fabiolimace commented 3 years ago

Are you suggesting the current v8 proposal works for this? As I noted above, I think it's overly restrictive in its current form.

IMO all restrictions can be removed from v8 except version and variant bits.

edo1 commented 3 years ago

IMO all restrictions can be removed from v8 except version and variant bits.

Even sortability is not required? IMO it is the error-prone way. The use of proprietary UUID should be avoided whenever possible.

My suggestion, there are use cases:

"just a unique identifier", v4 should be used; This algorithm only needs a good RNG to generate a collision-free identifier.
"reproducible identifier", v5 should be used;
"globally sortable time-based identifier", v7 should be used; This algorithm could be used to generate roughly time-sorted identifiers generated around the world (or for monotonic centrally-generated sequences).
"proprietary sortable identifier", v8 should be used (if it will be decided to leave it in the final RFC version);
"proprietary non-sortable identifier", avoid this, use v4 or v5 instead.

kyzer-davis commented 3 years ago

@broofa

As I noted above, I think it's overly restrictive in its current form.

The goal was exactly what you and the others mentioned but I can relax it even more if required.

UUIDv8 is at its core a standards based UUID layout with 122 bits for whatever proprietary sortable identifier an application requires.

The text, figures and definitions in that section are really there to share some creation examples and detail best practices we have learned from working with v6/v7 on the topics of timestamp, sequence, node ordering to avoid sorting issues along with considerations for timestamp length and sequence length (i.e more exact timestamp less clock sequence required).

If I wanted to abstract the layout definitions even further the text definitions could be:

segment_a - Everything from first bit to version (48 bits)
ver- 4 bits (1000)
segment_b - Everything from version to variant (12 bits)
var - 2 bits (Assuming 10)
segment_c - Everything after variant (62 bits)

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                            segment_a                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |           segment_a           |  ver  |      segment_b        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |var|                       segment_c                           |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                           segment_c                           |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

With this in place I can use write very simple "UUIDv8 Basic Creation Algorithm" or modify "General algorithm for generation of UUIDv8 not defined here" from that section. It may still be good to have a real example that has some bit allocations just to provide more context. The "48-bit timestamp, 12-bit sequence counter, 62-bit node:" example number 2 I currently have seems like a good contender but if a general is enough I will drop them all.

uuid6 / uuid6-ietf-draft

UUID version for proprietary formats #31