Discussion: SHA256 UUID Generation

kyzer-davis commented 1 year ago

Topic from Interim:

How does one create a Name-based UUID if MD5 or SHA1 are deprecated from a library/implementation, not available or otherwise deemed unsecure and cannot be utilized?

Restrictions:

Can't modify v5 to be SHA256. It must remain untouched.

Ideas

UUIDv9 (SHA256 Based UUID)

Totally Feasible as per https://github.com/ietf-wg-uuidrev/rfc4122bis/blob/main/editor-files/UUIDv3- v5-Testing.md?plain=1#L132
Ties up a new version in a type of UUID that is:
- "Not widely utilized or even implemented in many libraries." - Kyzer/Brad
- "Studies showed name-based UUIDs made up something like 1% of use cases" - Robert
- "Energy in previous draft was on time-based UUIDs and there was little to no discussion on name-based other than the original documents descriptions around these particular UUIDs is confusing" - Kyzer

Add Text and steer towards v8 as a use case

Paragraph or sentence that says if SHA256 is desired for any reason then the UUIDv5 steps can be utilized but the version must be incremented to v8 to utilize the experimental, vendor-specific implementation space
Consensus achieved on this proposal for steering text. Slated for Draft 02

LiosK commented 1 year ago

Add Text and steer towards v8 as a use case

+1

jimfenton commented 1 year ago

The problem I have using v8 for this is some of the use cases that were discussed for SHA1 and MD5: the desire for two nodes with the same object to be hashed to arrive at the same UUID. While it's possible to do this with v8 by prior arrangement, there must be a reason to have v3 and v5 specify the hash algorithm they use. One reason I can think of is if there is a future transition away from SHA256 to something else and it's important during a transition to know what UUIDs are generated with SHA256 and which with the newer algorithm.

mcr commented 1 year ago

I was wondering if we can/should pick some new vX, allocate 4 -bits to hash type, and put SHA256 there. (I also wonder if there is a way to wedge that into v3 or v5, but there probably isn't)

kyzer-davis commented 1 year ago

@mcr, @jimfenton: I thought about this a bit over the weekend and here are my thoughts summarized nicely.

Deprecate SHA1 and replace it with SHA256 as the "new" v5

Pro: Avoids new version, removes security considerations around SHA1. Cons: Forces legacy implementations to 'not RFC compliant' if they are old, lack compute. Furthermore, who the application using an existing v5 name may not be able to update the name across the application context very easily. Other: We opted to not replace v1 when we made v6, so I would be hesitant to replace v5 like this.

Cram SHA256 into v5

Pro: Avoids new version, removes security considerations around SHA1. Cons: Muddies v5 and can lead to some confusing scenarios where two peers calculate different outputs where one assumed SHA1v5 and one assumed SHA256v5. Other: To clarify, one such hypothetical scenario could be where Alice uses raw inputs to calculate a UUIDv5 and produced sha256-based output as an input to feed into some other application logic for further output computation. (Think some identity-name-based style HMAC like HTTP/SIP use). Then the Bob uses the same inputs, calculates a v5 sha1-based UUIDv5, feeds it into the same algo to further the process. The result is two outputs that differ and who knows what problem that causes. This is all assuming there are no out of band method to say "use sha256 and not sha1" etc.

Leverage SHA256 with v8

Pro: Leaves SHA1 UUIDs and implementations to operate how they have for the past 20 years. Cons: Possibly cause issues when SHA256 v8 co-exist with other UUIDv8s? Personally I don't think that is an issue; if we are doing name-based UUIDs in v8 there is likely some application logic shared between systems to know what algo to use in the first place (sha256 or some other like sha384 or even some next-gen quantum crypto hash algos). Other: To further clarify, the overlap of some other specific v8 time-based algo, some name-based algo or even that one guy who creates UUIDs that only contain readable words are not a problem. This is because in the grand scheme of the application context this use case for v8 likely wouldn't overlap or really inhibit another v8 implementations using them in some other manner. Basically UUIDv8 are really good within your application but are not guaranteed to the greater world by their design.

Allocate v9 for SHA256

Pro: Removes security considerations around SHA1 (can discourage v5 use and point v3 and v5 at v9). No ambiguity about usage. Cons: Likely to not be implemented very widespread, ties up a version (0 and 10-15 would be the only ones left). What do we do when somebody wants to use SHA384 or some some next-gen quantum crypto hash algos? Seems like a slippery slope I would like to avoid.

mcr commented 1 year ago

Kyzer Davis @.***> wrote:

Allocate v9 for SHA256

Allocate v9 for all-future hashes, with a hash-subtype.

jimfenton commented 1 year ago

@mcr That's also the direction I was thinking, although I don't have any first-hand knowledge on how these UUID versions are used.

ramsey commented 1 year ago

@mcr @jimfenton I came here to say the same thing. Perhaps some set of bits can be allocated as a hash ID. We would need to create a registry of hash IDs, and depending on the number of bits allocated for the hash ID, we'd be limiting the number of future hashes that could be added to the registry. In practice, this might never become an issue.

LiosK commented 1 year ago

One possible approach I'd suggest:

Predefine hash algorithm UUIDs and prepend one to name space ID and name

import hashlib
import uuid

# predefined by RFC 4122
NAMESPACE_DNS = uuid.UUID("6ba7b810-9dad-11d1-80b4-00c04fd430c8")

# predefined by new RFC (UUIDv4s I just made up for example)
ALGORITHM_SHA256 = uuid.UUID("3fb32780-953c-4464-9cfd-e85dbbe9843d")
ALGORITHM_SHA512 = uuid.UUID("e6800581-f333-484b-8778-601ff2b58da8")

# concatenate hash_algorithm_uuid + namespace_uuid + name; then hash
sha256 = hashlib.new("sha256")
sha256.update(ALGORITHM_SHA256.bytes)
sha256.update(NAMESPACE_DNS.bytes)
sha256.update(b"example.com.")
print("SHA-256:", sha256.hexdigest()[0:32], "(truncated)")

# ditto
sha512 = hashlib.new("sha512")
sha512.update(ALGORITHM_SHA512.bytes)
sha512.update(NAMESPACE_DNS.bytes)
sha512.update(b"example.com.")
print("SHA-512:", sha512.hexdigest()[0:32], "(truncated)")

# output:
# SHA-256: 564315c658dc181edb907cfa7d55605b (truncated)
# SHA-512: b096e1610da091aa73cdfe4d1132a5aa (truncated)

This approach ensures that the inputs to hash algorithms are unique per algorithm no matter what the namespace and name are, though such consideration could be meaningless for collision resistance because different hash algorithms are expected to produce very different sequences of bytes.

This approach doesn't allow us to identify the hash algorithm used by a given name-based UUID, but I'm not convinced that such reverse engineering is really necessary. It is anyway not possible to reproduce a name-based UUID without knowledge of the namespace ID and the original name, and with such knowledge, it is quite easy to determine the hash algorithm by trying all the few common algorithms. For the same reason, I am skeptical about allocating dedicated hash ID bits in the precious 128-bit space, especially when we have a different idea to guarantee uniqueness.

mcr commented 1 year ago

What @LiosK works for me. I just think that we will get pushback if we don't provide a way to use newer hashes in a deterministic way. I'm unclear what version would be used for the above approach.

LiosK commented 1 year ago

Although I don't have a strong opinion here, I believe a new name-based scheme is not worth a new version and should stay in the v8 because I'd totally agree with the points quoted by Kyzer:

"Not widely utilized or even implemented in many libraries." - Kyzer/Brad

"Studies showed name-based UUIDs made up something like 1% of use cases" - Robert

"Energy in previous draft was on time-based UUIDs and there was little to no discussion on name-based other than the original documents descriptions around these particular UUIDs is confusing" - Kyzer

A standard makes sense only when it coordinates multiple implementations to interoperate with each other. It might logically look flawed to define deprecated algorithm-based v3 and v5 only, but, if there are few people wishing for an updated name-based scheme, it wouldn't be really helpful to introduce a new standard just to fix the flaw.

Plus, v3 and v5 used to be the only mechanisms that could incorporate application-specific ID (name) schemes into the UUID space, but now we have v8 and can include whatever application-specific information in a UUID. I'd anticipate fewer use cases of v3 and v5 after the introduction of v8.

I'm also concerned about the inactive discussion so far over name-based schemes. I'm not even sure if truncating hash digests is a safe, secure, and valid approach to produce a universally unique identifier.

Anyway, we can perhaps add the discussion here to the best practice section and see if new name-based practices emerge.

ramsey commented 1 year ago

I'm not even sure if truncating hash digests is a safe, secure, and valid approach to produce a universally unique identifier.

I've often wondered this about v5. Introducing even longer hashes probably stands a greater chance of seeing repeating characters at the beginning of the hash, especially with the truncation involved to fit within 128 bits, though I'm no expert on hashing algorithms.

mcr commented 1 year ago

I don't object to v3/v5: they were in rfc4122, and were current at the time. But, we need to be clear that there is a way to use newer hashes in a deterministic way. @LiosK 's suggestion, as a v8 method works for me.. Please just add two paragraphs somewhere about how to do that. Not sure we need python code, but I don't object to it.

kyzer-davis commented 1 year ago

Just caught up on the thread.

Let me take a pass tomorrow at implementing some text around what @LiosK discussed so we can add some "best practice" logic to future hash based UUIDs in v8 bit space.

Also, @ramsey, totally agree. SRTP went with an approach back in the day of truncating the SHA to 32 or 80 length for the early SRTP Crypto Suites and they updated that when they added new algos: https://www.rfc-editor.org/rfc/rfc7714#section-13.2 Somewhat unrelated to the context but just one such example that comes to mind.

Perhaps when I get back to UUID Long we can provide an alternative which do not need to be truncated: https://uuid6.github.io/new-uuid-encoding-techniques-ietf-draft/draft-00/#name-uuid-long

LiosK commented 1 year ago

By the way, it's interesting that FIPS 180-4 (SHA-1 and SHA-2 standard) explicitly permits to take leftmost bits of a message digest:

Some application may require a hash function with a message digest length different than those provided by the hash functions in this Standard. In such cases, a truncated message digest may be used, whereby a hash function with a larger message digest length is applied to the data to be hashed, and the resulting message digest is truncated by selecting an appropriate number of the leftmost bits. For guidelines on choosing the length of the truncated message digest and information about its security implications for the cryptographic application that uses it, see SP 800-107 [SP 800-107].

I'm not sure if the same discussion applies to SHA-3 as well. Plus, FIPS 180-4 is currently under review for revision, and some public comments expressed concern about truncation. So, this section of FIPS 180-4 might not survive in FIPS 180-5, but probably we can find some useful discussion about digest truncation around FIPS 180-4 resources.

Edit: Perhaps, we should also take a look at SP 800-90A to deep dive into name-based schemes. If we can derive 122-bit random (statistically independent and unbiased) data from a name in a deterministic manner, then we can construct a UUID from the random bits. SP 800-90A discusses deterministic random bit generators (DRBGs) and SHA-1/2 functions as building blocks of DRBGs.

kyzer-davis commented 1 year ago

My preference: Cite that NIST SP 800-90A document as another resource for using random in applications properly in our later sections.

On the FIPS180-4 comment: We could cite at least FIPS 180-4 as something that allows truncation at least in those versions. If they don't want to truncate guide towards v8 (which also covers us if they disallow truncating down the road.)

LiosK commented 1 year ago

Oh, I didn't mean the RFC should reference SP 800-90A. I was just like saying: if we were to develop a new name-based scheme seriously, we would have to prove that the new scheme guaranteed the universal uniqueness of the outputs, and the NIST doc would be helpful for that.

At a glance, SP 800-90A seems to rely on truncated hash digests as a source of random (i.e., statistically independent and unbiased) sequences of bits. If so (I mean, if each SHA function produces leading 122 bits in a statistically independent and unbiased manner), the approach I suggested previously will produce universally unique IDs even if it truncates digests and mixes the results from multiple SHA functions in one 128-bit ID space.

kyzer-davis commented 1 year ago

I pushed up https://github.com/ietf-wg-uuidrev/rfc4122bis/commit/83d95da3418a92a13dfa5222c30795244f24f565

This contains:

Forward reference to best practices section on how to actually create one of these values
A quick comment on name space best practice to hit #53 topic to allow implementers to allow custom input rather than just the ones we put have in the appendix.
A section on how to create a v8 UUID with hashspace+namespace+name and desired hash of choice.
An appendix with every modern SHA2/SHA3/SHAKE algo with a random v4 UUID allocated as a hashspace (md5 and sha1 omitted)
A name-based v8 test vector for SHA2-256 hashspace + DNS namespace + www.example.com
Renamed old v8 test vector to be time-based to better distinguish these two.

Testing:

Used my awful bash script.
A delta of @LiosK python code to use input of www.example.com rather than example.com. to be consistent with the MD5 and SHA1 test vectors.
- I also changed the bytes("www.example.com", "utf-8") since that is what the official python library does although I saw not observable difference.
Those tests are in the UUIDv3-v5-Testing.md file also in the commit.
Both output the same starting sha256 which is the only item I was trying to verify.

LiosK commented 1 year ago

Looks awesome!

Could sound obvious, but I think we should add some clarifying text to the Some Hash Space IDs section saying like when using SHAKE_128 or SHAKE_256, implementations must extract at least 128 bits for a digest, because these variable length algorithms may technically produce a digest shorter than 128 bits.

kyzer-davis commented 1 year ago

@LiosK,

Understood, I was not aware they are variable rate! I totally missed that last night looking at the SHA table here: https://en.wikipedia.org/wiki/Secure_Hash_Algorithms

Edit, updated as per https://github.com/ietf-wg-uuidrev/rfc4122bis/pull/60/commits/b094dfc0f75f53d8f3d63f9daccb8a3e3614959f

kyzer-davis commented 1 year ago

Question from @jimfenton on Interim:

Should we remove "or larger?" Does SHAKE-128 output of 128 differ from SHAKE-128 with output of 256"

Draft 02 Proposed Text:

An important note for secure hashing algorithms that produce variable rate outputs, such as those found in SHAKE, the output hash MUST be 128 bits or larger.

Testing

The first 128 bits that we are about are always the same and do not change even if you request an output of more bits. So it can be at or larger and be okay Tested using online tool, with openSSL+bash and cited from the original doc below. So we should be okay to keep the text "the output hash MUST be 128 bits or larger."

SHAKE-128

https://emn178.github.io/online-tools/shake_128.html

Input: Hello World
128: 1227c5f882f9c57bf2e3e48d2c87eb20
256: 1227c5f882f9c57bf2e3e48d2c87eb20f382a4b639b54d26f6d595ff3db9064d

kydavis@ubuntu-22:~$ echo -n "Hello World" | openssl dgst -SHAKE128
SHAKE-128(stdin)= 1227c5f882f9c57bf2e3e48d2c87eb20

SHAKE-256

https://emn178.github.io/online-tools/shake_256.html

Input: Hello World
128: 840d1ce81a4327840b54cb1d419907fd
256: 840d1ce81a4327840b54cb1d419907fd1f62359bad33656e058653d2e4172a43

kydavis@ubuntu-22:~$ echo -n "Hello World" | openssl dgst -SHAKE256
SHAKE-256(stdin)= 840d1ce81a4327840b54cb1d419907fd1f62359bad33656e058653d2e4172a43

Source:

https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf A.2 Additional Consideration for Extendable-Output Functions

By design, the output length for an XOF does not affect the bits that it produces

Consequently, when two different output lengths are chosen for a common message, the two outputs are closely related: the longer output is an extension of the shorter output.

ietf-wg-uuidrev / rfc4122bis