Discussion: Shared Knowledge Schemes and Node Identifiers

bradleypeabody commented 3 years ago

Premise: Despite the UUID name, "universally unique" values are only achievable with some sort of pre-arranged pattern that decides which systems will provide which values. Adding more bits and entropy to the random parts of a UUID reduces collision probability, but cannot eliminate it. And there is no such thing as "enough collision resistance for any use case anywhere". The use of the MAC address in UUIDv1 was an attempt to use this approach. The drawbacks with it were some security risk in exposing MAC addresses, and the fact that with virtual computing so commonplace these days MAC address uniqueness can no longer be guaranteed. But the core idea behind it is: if you want to guarantee uniqueness, you have to have some sort of global registry or rules system which ensures some part of the UUID will be different from any other system. I'll refer to this approach as "shared knowledge" for lack of a better term (i.e. the different machines on the network share the knowledge of the MAC address registry and depend on it for uniqueness guarantees).

Question/proposition: Should the draft/spec indicate some means guaranteeing global uniqueness through shared knowledge. While the approach of using the MAC address is flawed, it does address a vital need and expose a problem with the current proposal: We still don't have guaranteed universally unique identifiers.

Prior art: A number of "this MUST be unique everywhere" registry systems exist (it's different from UUID generation but the core problem is the same). There are probably a lot more such system than I'm aware of. But just to list a few:

Domains
IP addresses (v4 and probably now v6 is much more relevant)
USB vendor IDs
JEDEC for flash memory (although it's only 8 bits devoted to the organization ID :) )
And of course, MAC addresses

I'm sure there are many more. But you see what's happening here? When people REALLY need uniqueness, they create a registry, because there isn't any other way to achieve uniqueness guarantees.

Possible approaches:

Recommend the use of MAC addresses, like UUIDv1 did. (Problem: MAC addresses don't have uniqueness guarantees these days - random data provides better collision resistance against possible MAC address duplication)
Use IPv6 addresses. (Problem: not always available, and they are really long - 128-bits. Compression of zero values is possible, but still...)
Make a registry. This is the only practical course I can see so far, but the question is if it's worth the effort. But, for example, a registration scheme could be created so a prefix could be allocated that is guaranteed to be unique and not assigned to anyone else. Using this, such a value placed starting at some standardized position in the UUID could provide a full 100% uniqueness guarantee. It could have a variable length encoding, so as the numbers grow larger (e.g into the billions), it would take up a few more bytes in the UUID, but would still guarantee uniqueness. An encoding could be arrived at which ensures that later registry numbers with longer encodings don't conflict with earlier ones that happen to be followed by various random byte patterns - I'll lay this out in detail if there's enough agreement on the idea.

Thoughts? If we really want uniqueness, we need to confront the fact that there's only one way to guarantee it.

EDIT:

Looking at the language of RFC4122 again, it seems like they specifically wanted to avoid any central registry, but made an exception for MAC addresses:

One of the main reasons for using UUIDs is that no centralized authority is required to administer them (although one format uses IEEE 802 node identifiers, others do not). As a result, generation on demand can be completely automated, and used for a variety of purposes....

Indeed that's probably one of the main arguments against any such registry: It's inconvenient. And also, if we've learned anything from MAC addresses: If anything goes wrong with the system, using random data (which is generally fast, easy and pretty much always available) is better anyway, i.e. lower collision probability.

sergeyprokhorenko commented 3 years ago

What about account or license number (hash) for the DBMS or operating system? But I still believe that random number is more convenient.

bradleypeabody commented 3 years ago

@sergeyprokhorenko Unfortunately DBMSs and OSs do not contain any values that I'm aware of that have global uniqueness properties. The closest thing I can think of is an IPv6 address, but that's not always available, and is not guaranteed to be unique in cases where multiple programs are running on the same machine. I agree that using a random number instead is convenient and should always be an allowed option.

edo1 commented 3 years ago

Recommend the use of MAC addresses Use IPv6 addresses.

This is a non-universal solution (for example, the host may have no public IPv6-address or the application may not be allowed to get it), with the lack of security (the disclosure of IP/MAC-addresses can be considered as a vulnerability).

Problem: MAC addresses don't have uniqueness guarantees these days - random data provides better collision resistance against possible MAC address duplication

Blocks of universally unique MAC addresses/Ethernet addresses are assigned by the IEEE Registration Authority (RA). This clearly shows that a registry is a bad idea for a uniqueness guarantee.

IMO hash of user-provided string is a better idea. It is the responsibility of the user to create a unique string. Standard can contain a suggestion, e.g. something like Java package/class names could be advised: com.somenicecompany.dbms.sales.invoice Yes, this does not guarantee the impossibility of collisions. But the probability of a collision will be so small that it can be considered negligible.

BTW, UUID v3/v5 takes the same approach.

bradleypeabody commented 3 years ago

I'll also just add here as another important factor: One should consider what the cost/penalty of getting a duplicate UUID is.

This can be vastly different in different scenarios. Some examples with my opinion on the importance:

Low Impact: A UUID generated for a single network transmission could cause, e.g. a duplicate log entry. If one is logging billions of items and every once in a blue moon there is one duplicate that makes some statistic that is derived from this data a tiny bit wrong - in many cases that is going to be completely acceptable. (If we're getting detailed about, one could easily argue that the percentage of error introduced by a scenario is much less than the margin of error already present in the original measurement - think about things like sensors for temperature or weather conditions - the instruments themselves taking the measurements are far from perfect)
Medium Impact: A duplicate database key causes an insert to fail, but also it will probably be retried and succeed. This is definitely not good from a design perspective and once certainly wouldn't want this to happen too often, and it is definitely an inconvenience to the application. But, if you consider the fact that databases already have strange events that happen due to other factors (disk failure, network partitioning, etc.), this case, as long as it doesn't happen often, doesn't seem so bad. Not great, but not a death sentence either.
High Impact: A duplicate key causes an airplane to get a wrong course and puts people's lives at risk. In this scenario, I think most people would agree that there is no margin for error and any deliberate action which increases this possibility is unacceptable. Instead of lowering collision probability, one needs to guarantee uniqueness. Now it's important to note that you don't actually need "global uniqueness" in this case - you just need uniqueness within the context of the application. A "shared knowledge" approach can easily be implemented without having to solve this on a global scale (e.g. assign each air traffic control tower a number and put that in a specific place in the "random" part of the UUID - problem solved).

Just more food for thought when we talk about the motivations for actual real "global uniqueness".

bradleypeabody commented 3 years ago

@edo1 Agreed on IP addresses, it's not practical in many cases.

This clearly shows that a registry is a bad idea for a uniqueness guarantee.

How so? I mean I agree I wouldn't want to have to go through such a registration process very often, but if the stakes are high enough for a given application, I don't see why not. Definitely it's not something everyone would be doing, for sure.

IMO hash of user-provided string is a better idea.

Yeah this basically boils down to the application should pick a sensible approach and stick to it - hashes are certainly one way to go.

edo1 commented 3 years ago

How so?

I faced MAC address conflict few times. Therefore, the existence of a MAC address range registry does not guarantee MAC address uniqueness. My guess is that a random address of the same size would have a lower collision rate in the real world.

sergeyprokhorenko commented 2 years ago

The local entity type (10 bit) (i.e. database table alias) as a last field of UUID would be a convinient additional "shared knowledge". It can be used to quickly find database tables that contain this UUID.

sergeyprokhorenko commented 2 years ago

These values can be used as "shared knowledge":

Shard or Horizontal partitioning tag
Hash of source system name
Entity type (i.e. database table alias). This optional field with local values for the specific DB is intended to establish polymorphic relationships between DB tables of complex applications. It also may be used as an anchor name prefix in Anchor modeling or for search of database tables that contain this UUID

sergeyprokhorenko commented 2 years ago

@broofa , what is the reason of your dislike?

broofa commented 2 years ago

@sergeyprokhorenko Generally speaking, the problem with central authorities is that they require a complex architecture with requirements that are likely to be out of scope of a timestamp-based format. We'd be trying to shoehorn some form of as-yet-unknown identifier alongside the timestamp information without even knowing whether or not (for example) a timestamp is necessary in such a scheme. I mean... if you have a central authority, are you going to need a timestamp even?

For example, I believe Etsy uses a system where blocks of IDs are doled out to the various subsystems. In such a scheme, the best format to use would be something like the forthcoming "experimental" version 8, where the only defined bits are version and variant, and everything else is vendor-specific.

As for encoding entity-type information, I believe that is decidedly out of scope. That is not information that makes a meaningful contribution to uniqueness and should be handled external to the UUID format.

sergeyprokhorenko commented 2 years ago

@broofa I understand your motives. Almost everyone hates the world of large corporations. I, too, like you. But we have to live with them and try to meet their needs, so as not to be left on the sidelines ourselves. The new RFC should also take into account the quite predictable needs of large corporations, and not leave it all to a vague version 8. It is necessary to list possible UUID components useful for large corporations in the new RFC. It will be like a set of options. It's really easy.

I will describe to you the real problems of one of the largest special depositaries. It monitors compliance with investment declarations by mutual funds and pension funds. All objects in its accounting and information system have their own UUID: securities issues, portfolios, investors, investment declaration rules, etc. UUIDs are the keys in all database tables.

The accounting and information system is very slow because the records in the database tables are not ordered by increasing UUID. They have to buy faster servers, or they could just use UUIDs ordered by creation date and time.

It's very easy and convenient to type any UUID into the search field to get a list of all tables containing it. But I can imagine how difficult it is to provide this, since there are no pointers to tables in the UUID itself.

In addition, the huge and unwieldy posting table contains dozens of fields, and they have to be resourceful not to add more fields. The problem could be easily solved with the help of polymorphic relationships between DB tables. But for this UUID keys must contain pointers to tables (i.e. database table alias).

broofa commented 2 years ago

@sergeyprokhorenko What you're describing has nothing to do with guaranteeing uniqueness via the use of a central authority.

sergeyprokhorenko commented 2 years ago

@broofa, "shared knowledge" (in the network) is a broader concept than "central authority"

broofa commented 2 years ago

We disagree. Central authorities serve the entire "Universe" (first "U" of UUID). Your shared knowledge example seems to be implicitly scoped to the "universe" (little "u") of systems that care about these depository tables. You haven't described how they agree on what tables name there are, or what identifiers are to be used to refer to those table names.

If you want to argue your example is relevant to UUIDs, you''ll need to point to some central authority that is responsible for allocating db table name identifiers to ensure their uniqueness across all systems everywhere.

sergeyprokhorenko commented 2 years ago

If you want to argue your example is relevant to UUIDs, you''ll need to point to some central authority that is responsible for allocating db table name identifiers to ensure their uniqueness across all systems everywhere.

No. These UUID fragments are governed by the corporation and are used by the same corporation as any of its private classifiers. However, the UUIDs with these fragments are real UUIDs: they are unique and ordered. Thanks to fragments, they also get additional protection against collisions with UUIDs generated for other purposes. Such UUIDs can also be used outside of the generating corporation (with or without reading embedded UUID fragments).

kyzer-davis commented 2 years ago

3/1/2022 Update: The latest from Draft 03 on the topic of Shared Knowledge and Node Identifiers is found in the Distributed UUID Generation section.

uuid6 / uuid6-ietf-draft

Discussion: Shared Knowledge Schemes and Node Identifiers #36