[RFE] `fields` and `time_*` properties must not be used on UUIDs that are time-agnostic.

picnixz commented 5 months ago

Currently, the uuid.UUID class features a fields argument and property which is a six-tuple of integers:

32-bit time_low,
16-bit time_mid,
16-bit time_hi_version,
8-bit clock_seq_hi_variant,
8-bit clock_seq_low,
and 48-bit node.

Currently, those fields are only relevant when the UUID version is 1 since UUIDv3 and UUIDv5 are based on MD5 and SHA-1 respectively instead. However, the recent RFC 9562, superseeding RFC 4122, introduces one time-based UUID, namely UUIDv6 (basde on UNIX epoch instead of Gregorian epoch, and with timestamp bits ordered differently), as well as UUIDv7 and UUIDv8 that are implementation details.

Here is what we can do for now:

For version 7, we can have:
- a cryptographically secure 74-bit chunk split into a 12 and 62-bit chunks, or
- a monotonous UUID with 12-bit sub-milliseconds precision chunk.
For version 8, we can have:
- a time-based UUID with 10-ns precision with a 60-bit timestamp and 62 bits of random data, or
- a name-based UUID which uses secure hashing algorithms such as SHA256/SHA-3/SHAKE-256 (see here for an example of SHA256-based UUIDv8), or
- a non-cryptographically 122-bit chunk split into chunks of 48, 12 and 62-bit independent chunks. Those chunks can also be supplied by the user if they want cryptographically secure values (although I would suggest generating a UUIDv4 and change the version and variant bits manually).

With the addition of those variants, we at least have one UUID distinct from UUIDv1 featuring time-related fields. In particular, it is important to decide whether fields[0] is the first RFC field in the UUID or if this is always the first 32-bit fields. I personally think that we should say that fields represents the RFC fields, namely, fields[0] is a 32-bit integer corresponding to the 32 LSB (resp. MSB) of the 60-bit timestamp for UUIDv1 (resp. UUIDv6).

For UUIDv7, if we choose sub-ms precision, then the fields are a bit different in the sense that we now have unix_ts_ms (48) | ver (4) | subsec_a (12) | var (2) | counter (62), so we should decide how to split those fields into 6 and whether it make sense to have the corresponding properties. A similar question arises for UUIDv8.

While we could change the semantics of the fields and time_* properties, this could break applications that assume that fields are independent of the time (my understanding of fields is that it is independent of the time or the RFC and is only a way to partition the UUID's integral value into 32+16+16+8+8+48 bits, but such partitioning is only relevant for UUIDv1).

Therefore, I really don't know how to deal with those time-based properties. I'd like to avoid breaking longstanding applications but at the same time I don't want a property to incorrectly reflect its value. If we don't change anything, uuidv6.time_low would actually return the 32 highest bits...

EDIT: Should this actually be a PEP? because UUIDv7 and UUIDv8 are implementation-detail so maybe a PEP might be a good idea?

89083
120650
121119

picnixz commented 5 months ago

@Eclips4 I don't know who to ask, but I know that you're a triager so I'm wondering if you know who should be the person to ask on that topic.

Eclips4 commented 5 months ago

https://devguide.python.org/core-developers/experts/ is states that uuid doesn't have an expert :( I will ask in the private core-dev chat whether anyone is interested in this issue.

gpshead commented 5 months ago

My first take on this: It is (1) a feature request. It is also (2) a clarification of the existing API and how it should be used, on the existing already supported UUID v1-5 types vs on proposed newly added support for v6-8. We're split across this issue and the already linked #89083 for this overall topic.

I agree that the fields: tuple makes sense to be "RFC fields" for a given UUID version. But for v1-5 we need to keep existing behavior as applications may depend on that even if we look at it oddly and don't think those fields line up with some of the v1-5 definitions (times on non-time based UUIDs for example?). This preserves backwards compatibility.

We cannot change the numeric indices of things in fields for use with UUID v1-5 ever. But we could update the named properties like time_low and time_high to be deprecated if accessed on UUIDs where those fields do not make any sense (DeprecationWarning) if there is appetite to prevent their access when irrelevant.

It may not make sense to bother with a deprecation though if maintaining their existing behavior and documenting when they should or shouldn't be used is enough. (ie: do we have evidence of their existence on irrelevant UUID versions as causing people to write buggy code? If so, deprecating them would help guide our users away from bugs!)

fields wise, I don't like having a tuple and numeric index based access at all as an API. That exposes internal implementation details. Is there ever a situation where people are supposed to prefer accessing something via a field raw number in practical code? I doubt it. It is much nicer to require named property only access to parts of a UUID.

The old tuple accesses are effectively a legacy of old-style Python API design as the uuid module was added in Python 2.5 when that was just how things were done because a common better way did not yet exist. namedtuple did not even exist (that showed up in 2.6) yet, and dataclasses were still a dream.

So given that... We should ask ourselves if a fields tuple even needs to be supported at all for UUID v6-8? It might for v6 if that is really intended to be a simulacrum of v1. But should it for v7-8? The oddity is that we have a single class UUID that is trying to represent all past, present, and future UUID types in one API. So it could get awkward for some code if it needs to conditionally use one API vs another based on the .version attribute. But realistically I believe that is already the case given the different internal data different UUID versions may contain.

How often does code need to accept arbitrary UUIDs of all possible versions at once? I'd expect most applications to only ever be dealing with a single known UUID type, or perhaps two.

I am decidedly not a UUID expert. I'm just playing one on TV. :P

picnixz commented 5 months ago

It is (1) a feature request.

Oups, sorry my bad for the incorrect category!

But for v1-5 we need to keep existing behavior as applications may depend on that even if we look at it oddly and don't think those fields line up with some of the v1-5 definitions (times on non-time based UUIDs for example?). This preserves backwards compatibility.

Definitely. I know that UUIDs can be used for long-standing preservation so if we change the semantics, we could break applications.

do we have evidence of their existence on irrelevant UUID versions as causing people to write buggy code?

That I cannot say. An UUID object is opaque and I can only hope that people will use it as a black box and not try to do black magic except for tests. The only field I'd expect them to use publicly and reliably is the time which gives you when the UUID was generated. I also expect them to create manually UUIDv1 using the fields argument but for other UUIDs, it doesn't make sense to use that field.

We should ask ourselves if a fields tuple even needs to be supported at all for UUID v6-8?

If I could have the final say, I'd hope not to have them! It's useless for constructing them from fields and it's incorrect to use the corresponding properties (the semantics don't match!).

But should it for v7-8?

Well.. it depends on the implementation. v7 and v8 can be made time-agnostic or not.

fields wise, I don't like having a tuple and numeric index based access at all as an API

Actually, an UUID is independent of its version and its ABNF is:

   UUID     = 4hexOctet "-"
              2hexOctet "-"
              2hexOctet "-"
              2hexOctet "-"
              6hexOctet

I think the reason why they were separated as such is historically based on UUIDv1 where the ABNF fields match the UUIDv1 fields and properties (the first four octets are fields[0], the 2 next are time_mid, then time_hi + version, the clock_seq + variant and the last one is the 48-bit node). But for version > 3, it's no more the case and we have a mismatch between fields[0] and the corresponding time-component.

Ideally, fields should be version-agnostic and fields[0] should always be the first 32-bits (in other words we should assume that fields represents the UUID's URN fields, not the time fields). However, the docs should be updated to reflect this decision.

In summary, I think fields should not be modified at all, even for future UUIDs. They should always reflect the UUID's URN I think. I'll revert the commits that touched the properties and adapt the test. It'll be easier to review and less prone to errors. For the other properties, I still don't know what to do with them...

encukou commented 5 months ago

I'm also not a UUID expert, but I suggest:

Right now, just add new constructor functions. Don't change the class, don't deprecate existing fields or add new ones.

In another PR, document how the field names are misleading. A docs-only PR can be backported to 3.13 and 3.12.

Then let's get back to this issue. IMO, we should avoid changing semantics if at all possible. Adding new fields is OK (especially if they're only on objects where they make sense); deprecating is possible but I'd rather avoid it if it's possible to use the existing API correctly.

picnixz commented 5 months ago

I like that plan. So this specific issue will be stalled until the other one is resolved. However, for the other one, we still need to decide which implementation to go for.

kyzer-davis commented 3 months ago

Hello, RFC 9562 author here.

Generally I agree with @encukou's approach.

I vote to leave fields as is and possibly note in the docs that it is really for v1. Sure it can be used for v3-5 but it derives time and node which wouldn't be applicable (and even in the code it is only ever used for uuid1's return.)

Then later, take more modern approach like a dataclass for each version. Something like uuid.UUIDv4(...) (since generate functions lack the "v" see: uuid.uuid4()) which can take similar inputs as uuid.UUID but the field are named nicely vs being at generic fields tuple.

Edit: Then again, playing devil's advocate, we don't really need version specific fields or as an example: the ability to get .random_b from a given UUIDv4.

[...] An UUID object is opaque and I can only hope that people will use it as a black box and not try to do black magic except for tests. The only field I'd expect them to use publicly and reliably is the time which gives you when the UUID was generated. I also expect them to create manually UUIDv1 using the fields argument but for other UUIDs, it doesn't make sense to use that field.

I had a small blurb in a section that follows your line of thinking: https://www.rfc-editor.org/rfc/rfc9562.html#name-opacity

python / cpython

[RFE] `fields` and `time_*` properties must not be used on UUIDs that are time-agnostic. #120878

89083

120650

121119