httpwg / http-extensions

HTTP Extensions in progress
https://httpwg.org/http-extensions/
436 stars 145 forks source link

Hash format in HTTP Headers #2781

Closed pmeenan closed 5 months ago

pmeenan commented 6 months ago

There are currently two cases where the sha-256 hash of the compression dictionary is sent in HTTP headers. In the Available-Dictionary request header when a client is advertising a dictionary available for a given request and the Content-Dictionary response header when the server responds with a dictionary-compressed resource.

In the original draft and Chrome's first origin trial, the hash format was a hex string representation of the hash.

It was changed in PR #2680 to use the structured fields binary stream representation (base64-encoded with colon's on either side).

The hex string tended to be easier for developers to work with because it is filesystem-safe so the hash could be part of the file name but it is also bigger in HTTP headers and not the correct representation for a binary field.

This is pulling together the various pieces of feedback to see if there is something that should be done or if the developer workflow can be simplified when working with the base-64 hashes.

pmeenan commented 6 months ago

From @yoavweiss in a comment on the PR:

I think this change adds significant complexity, and I'm not sure regarding its benefits.

I very recently played around with prototyping a compression dictionaries deployment and was very happy with the simplicity of it:

You generate diff files at build time with some naming convention Then at serving time, you configure your routing logic (similarly to how @pmeenan https://github.com/httpwg/http-extensions/pull/2680#issuecomment-1810759792) to those diffs That can be done today in almost any deployed routing layer.

Once this change lands in implementations, a similar deployment would require me to add custom logic to transform that header value to something that is file system friendly. It doesn't matter if that transformation is to base64 decode the binary data and then to hex encode it, or to 'just.. strip a few characters and remap "/"'. It is an operation that is not necessarily supported in many layers that developers currently operate in.

To take just one example - if I were to implement this as a Cloudflare transform rule, I'm not even sure it's feasible to do that. (although it might be, with a creative regex_replace).

Can you elaborate on the advantages of moving to base64, beyond theoretical purity?

pmeenan commented 6 months ago

From @nhelfman on the listserv:

I’ve been dedicating a considerable amount of time to incorporate compression dictionaries into our workflow. I must express that it’s rather challenging to utilize the encoded sf-binary format of the hash in the header.

The hex hash string is prevalent and is employed in various instances during troubleshooting. This includes the file path, SRI attributes, file name, dictionary page in Chrome, among others. Given that the sf-binary is encoded as base64, which isn’t a valid URL/filename, it necessitates frequent re-encoding to be functional. Furthermore, when the multi-encoded value needs to be cross-checked for accuracy against another resource, which is a standard hex string, I find myself constantly removing the colons and decoding it for swift visual inspection or analysis.

While it’s accurate that sha-256 is a byte array, the hex string representation of sha-256 has become the universally accepted method of handling this type of data. Therefore, I strongly advocate for maintaining this approach.

pmeenan commented 6 months ago

One thing that come to mind when I was considering options was for the developer tooling to use base64url encoding for the hash (and strip the padding).

It's the same as base64 encoding but with + replaced by - and / replaced by _ to make them both URL and filesystem-safe.

The dev tools case of looking at request headers should be easy to see that the correct dictionary is being sent and the case of serving a file from disk could be handled with a simple string substitution instead of base64-decoding and then hex-encoding the hash.

nhelfman commented 5 months ago

I can share some more input on this.

Since we have encountered scenarios where we had to use case-insensitive file name, then using any form of base64 (even with the char replacement described above) couldn't work since it is a case-sensitive encoding format.

We attempted to do the decoding of the base64 and then converting it to hex string on the CDN level but ran into issues. I'm not saying it cannot be done but it is certainly not something trivial.

We ended using base32 (which is case insensitive) on the base64 hash which IMO feels like a hack where a simple sha256 hex string could have been used for the file name.

pmeenan commented 5 months ago

@martinthomson @reschke @davidben how strongly do you all feel about the Available-Dictionary hash being represented as a Structured Field bytestream vs a lowecase hex-encoded token (from the PR discussion here).

Every participant (literally, every one) in Chrome's origin trial that has provided feedback has mentioned that working with the sf-binary headers was more complex than the hex-encoded headers. They all got it working but it's a point of a fair bit of friction.

As best as I can tell, the vast majority of command-line sha256 tools and language libraries either default to or provide a lowercase hex representation as the default output. It may not be the raw digest of the hash, but the hex string is used pretty universally, particularly when verifying file contents.

CLI tools

Linux sha256sum:

$ sha256sum draft-ietf-httpbis-compression-dictionary.md
15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80  draft-ietf-httpbis-compression-dictionary.md

Macos shasum:

$ shasum -a 256 draft-ietf-httpbis-compression-dictionary.md
15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80  draft-ietf-httpbis-compression-dictionary.md

Openssl cli:

 $ openssl dgst -sha256 draft-ietf-httpbis-compression-dictionary.md
SHA2-256(draft-ietf-httpbis-compression-dictionary.md)= 15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80

Windows certutil

> certutil -hashfile draft-ietf-httpbis-compression-dictionary.md SHA256
SHA256 hash of draft-ietf-httpbis-compression-dictionary.md:
15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80
CertUtil: -hashfile command completed successfully.

Libraries

Most c/c++ code generally operates on the binary digest and requires a manual hex conversion.

In all of these cases except for Node.js, there is an extra conversion step from binary to base64-encoding (in addition to any normalization that people decide to do).

mnot commented 5 months ago

Nit - if it starts with a digit, it's not a Token; you'll need to either always use a String, or say use a Token when possible, otherwise a String.

On 7 May 2024, at 01:14, Patrick Meenan @.***> wrote:

@martinthomson @reschke @davidben how strongly do you all feel about the Available-Dictionary hash being represented as a Structured Field bytestream vs a lowecase hex-encoded token (from the PR discussion here). Every participant (literally, every one) in Chrome's origin trial that has provided feedback has mentioned that working with the sf-binary headers was more complex than the hex-encoded headers. They all got it working but it's a point of a fair bit of friction. As best as I can tell, the vast majority of command-line sha256 tools and language libraries either default to or provide a lowercase hex representation as the default output. It may not be the raw digest of the hash, but the hex string is used pretty universally, particularly when verifying file contents. CLI tools Linux sha256sum: $ sha256sum draft-ietf-httpbis-compression-dictionary.md 15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80 draft-ietf-httpbis-compression-dictionary.md Macos shasum: $ shasum -a 256 draft-ietf-httpbis-compression-dictionary.md 15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80 draft-ietf-httpbis-compression-dictionary.md Openssl cli: $ openssl dgst -sha256 draft-ietf-httpbis-compression-dictionary.md SHA2-256(draft-ietf-httpbis-compression-dictionary.md)= 15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80 Windows certutil

certutil -hashfile draft-ietf-httpbis-compression-dictionary.md SHA256 SHA256 hash of draft-ietf-httpbis-compression-dictionary.md: 15c7dced1dc18aa0d362253d572ce9155aac9534cf7a5014ef20c719bd1f5f80 CertUtil: -hashfile command completed successfully. Libraries • PHP hash() or hash_file() - defaults to lowercase hex, optional binary. • Python hashlib - offers methods digest() for binary or hexdigest() for lowercase hex • Node.js crypto hash.digest() - defaults to buffer, offers "hex" as an available encoding Most c/c++ code generally operates on the binary digest and requires a manual hex conversion. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Mark Nottingham https://www.mnot.net/

LPardue commented 5 months ago

Just as some additional reference points:

Nobody complained when digest fields (RFC 9530) decided to use SF Byte Sequence to convey the byte output of the hashing. The context is different, of course. Some of the supported algorithms had their own canonical outputs, leading to inconsistencies in 3230. That's a problem we fixed that doesn't seem to exist here.

SRI uses base64 - https://w3c.github.io/webappsec-subresource-integrity/#grammardef-base64-value. Hence I'm confused by the earlier comment stating hex is prevalent in SRI.

Most libs I'm familiar with only deal with inputs and outputs as byte strings. CLI tools naturally need to use something that is safe for a CLI environment. But they tend to provide a trivial way to produce binary to pipe to something else. The SRI draft has OpenSSL as an example - https://w3c.github.io/webappsec-subresource-integrity/#example-cc8e7f02.

A common source of user error in dealing with hashes, in my experience, is double encoding of values due to variation in tool's default output format. For instance, running sha256sum and then base64 encoding that output, which is obviously not what anyone wants. Whatever we pick as the format, we should stress the interop needs of hashing and encoding the correct bytes.

mnot commented 5 months ago

One more thing -- if folks are using a SF library (and they really should be), it'll be exposing the raw binary to them, not the base64. So what's really happening here?

pmeenan commented 5 months ago

One more thing -- if folks are using a SF library (and they really should be), it'll be exposing the raw binary to them, not the base64. So what's really happening here?

In the case of delta-compressed static resources, once the server sees the Available-Dictionary request header, they need to check to see if a version of that resource is available that was compressed using the same dictionary. In a lot of cases, that means looking for a version of the file (in cloud storage, cache or filesystem) that has the dictionary hash in the file name. At that point, a filesystem-safe ascii representation of the hash of some kind needs to be used and in the case of Windows, possibly case-insensitive.

In most cases, right now that is being done by hex-encoding the sf-binary data (if using a structured-fields-aware header parser) or doing something like binhex(base64decode(value.strip(":"))) or, in @nhelfman's case above, base-32 encoding the base64 header value.

Also mentioned above, when manually looking at the headers and matching against the access logs, a transformation of some kind needs to be done instead of being able to visually verify the file name directly.

If tooling that generates the delta-compressed files used the base64url-encoded version of the hash as the ascii representation, it should solve a lot of those problems as it's a simple character substitution from the sf-binary base64 string and easy to visually compare. It's a little more work on the compression side since hex tends to be the native ascii representation for a lot of the tools but it's not as much work as converting the inbound header to the hex representation.

I don't know the digest fields use case as well, but I'm wondering if maybe the difference in feedback comes from the users that the header will be exposed to and processed by. In this case, at least until it is handled automatically by CDN's and servers, the developer application servers are the ones processing the header and matching the responses so it's exposed fairly high in the stack.

It's not a dealbreaker for sure, just a consistent piece of feedback that I wanted to make sure we took into account when deciding. I'd be ok with documenting all of the examples using base64url strings in the file names if that's what it comes down to though that doesn't help with the case-insensitive need for Windows.

pmeenan commented 5 months ago

One more thing -- if folks are using a SF library (and they really should be), it'll be exposing the raw binary to them, not the base64. So what's really happening here?

Hardly any dev-facing application servers are using a SF Library that I'm aware of. Most are either pulling one in for this case or manually parsing the sf-binary field (i.e. php, node, java, etc). It looks like there may be projects for the various languages but I'm not sure how robust they are.

pmeenan commented 5 months ago

For example, one workers-based implementation:

I don't think the workers API provides SF-parsed header values or a built-in way to parse them so it's either pull in a full SF parser if you can find one you trust or do string manipulation to convert the request header into something filesystem-safe (and validated).

mnot commented 5 months ago

FYI: https://github.com/httpwg/wiki/wiki/Structured-Fields

pmeenan commented 5 months ago

Thanks. I can see how a little pain now to drive further adoption of SF parsing libraries in app code could help move header parsing forward.

I'll close this out and Leave things as they are and will try to make sure any articles or how-to's include a reference to the parser libraries.

LPardue commented 5 months ago

I don't know the digest fields use case as well, but I'm wondering if maybe the difference in feedback comes from the users that the header will be exposed to and processed by. In this case, at least until it is handled automatically by CDN's and servers, the developer application servers are the ones processing the header and matching the responses so it's exposed fairly high in the stack.

The use cases and constituents are very different, so take what I said with a pinch of salt.

I just checked and RFC 3230 always required md5 and SHA algos to use base64, so I suspect that's why we decided to unify on that.

Digests use cases aren't using hashes to do indexing like described on this ticket ticket. I suspect a few cases of digest usage store computed hashes as sidecar data they pull up while parsing a request. In cases where Digest was used like Repr-Digest the value would have been dependendent on e.g. the content encoding.

Its my understandingthst webservers already make some optimizations around pre-computing representstions and deciding which to serve based off of request headers. It doesn't seem that different to also similarly deal with the header proposed in this spec. Are are folks suggesting they really rewrite all their URLs in middleware to append a file extension based on content negotiation?

pmeenan commented 5 months ago

Its my understandingthst webservers already make some optimizations around pre-computing representstions and deciding which to serve based off of request headers. It doesn't seem that different to also similarly deal with the header proposed in this spec. Are are folks suggesting they really rewrite all their URLs in middleware to append a file extension based on content negotiation?

I expect more will do it in the app server than in middleware (and the real middleware implementation is more involved and implements caching) but fundamentally the logic is the same - decode the header, check for a file on the filesystem that has the hash in the file name. Frequently, the compressed assets will be pre-compressed during a build/release stage (for the static file delta case).

As for servers picking a representation based on headers, it doesn't matter what the header looks like if they are the ones also compressing the asset.

For cases similar to gzip and brotli where the server (at least Nginx and I think Apache) will automatically look for .gz and .br versions of a file based on the Accept-Encoding header and serve pre-compressed assets, I assume the difficulty there would be the same. There needs to be a filesystem-safe version of the hash that the implementations standardize on and it's a lot easier to explain to people when the string on the filesystem matches the string in the request header without needing to perform transformations on it.