httpwg / http-extensions

HTTP Extensions in progress
https://httpwg.org/http-extensions/
407 stars 137 forks source link

Essential content coding metadata: header or body? #2770

Closed martinthomson closed 1 week ago

martinthomson commented 1 month ago

This is a discussion we've had several times when defining content codings, but it seems like there is never really a single answer.

Should the content coding be self-describing, or can it rely on metadata in fields?

The compression dictionary work uses header fields to identify which compression dictionary is in use. Originally, the client would indicate a dictionary and the server would indicate use of that dictionary by choosing the content coding. This made interpreting the body of the response quite challenging in that you needed to have a request in order to make sense of it.

More recently, the specification has changed to having the client list available dictionaries, with the server echoing the one it chooses. Both use header fields.

There is a third option, which is to embed the dictionary identification (which is a hash of the dictionary) ahead of the compressed content. This has some real advantages:

  1. Requests could use delta encoding. There is - of course - a real question about availability of the dictionary, but we have that problem with encrypted content codings and keys. That is a problem we solve for responses with a field. It might be solved using the same technique, combined with the client-initiated content coding advice. Or applications can use their own systems.
  2. A field costs a lot more bits. This would not benefit from header compression, so it might be a net loss in some cases, but in many cases it would be a distinct win because 32 bytes of hash in the body of a request is far less than a field name and field value containing a base64 encoding of those same 32 bytes.

It also comes with disadvantages:

  1. Header compression could remove a lot of the cost in bytes.
  2. It might be ever-so-slightly more complex for encoding and decoding to have to split the first 32 bytes off the body of a message.
pmeenan commented 1 month ago

On the encoding side, the main downside is that the cli tooling for brotli and Zstandard don't currently do the embedding so tooling would have to be added to prepend the hash to the files in a way that isn't standard for either (yet anyway) and for manually decompressing the files.

Zstandard has identifiers for the dictionaries when using non-raw dictionaries but both assume that raw dictionaries will be negotiated out of band.

Technically it would be a pretty trivial modification for clients and servers that are doing the work, I'm just a bit concerned about the developer experience changes (and whatever needs to be done to get both brotli and Zstandard to understand what amounts to new file formats).

pmeenan commented 1 month ago

I opened issues with brotli and ZStandard to see if they would consider adding it to the format of their respective file formats. If it's an optional metadata tag that is backwards compatible with existing encoders and decoders I could see it providing quite a bit of value, even in the non-HTTP case of dictionary compression.

pmeenan commented 1 month ago

There was some discussion in the ZStandard repo about possibly reserving one of the skippable frame magic numbers for embedding a dictionary ID but there's some level of risk for collision with people who may be using those frames for watermarking or other application-specific use cases.

As best as I can tell, the brotli stream format doesn't have a similar frame capability for metadata or application-specific data.

We could create a container format that held the dictionary ID and stream (header basically, not unlike zip vs deflate) but that feels like a fairly large effort and the tooling would have to catch up to make it easy for developers to work with.

At this point I'm hesitant to recommend adding anything to the payload itself that the existing brotli and zstd tooling can't process. Being able to create, fetch and test the raw files is quite useful for the developer workflow and for debugging deployments.

Would it make sense to allow for future encoding formats to include a dictionary ID in the file itself and make the header optional in those cases (and make the embedded ID authoritative)?

I'm not sure if that would make sense in this draft since this one is limited to the 2 existing encodings and it would be addressed in a new draft when the new encodings are created or if it makes sense to allow for it here without requiring that zstd-d and br-d explicitly require embedded ID's.

martinthomson commented 1 month ago

I don't think that you want to change the zstd or brotli format, only the content coding. That is, something like this:

def decode(body):
    hash = body[:32]
    dict = lookup_dict(hash)
    return decompress(body[32:], dict=hash)

This does partly work against the idea that you might have a bunch of files that contain compressed versions of content. You can't just say brotli -k --dictionary whatever file -o file-$(sha256sum whatever | cut -f 1 -d' ' -).br. And you can't just decompress without some processing. What you get in exchange is a message payload with the advantages I described.

pmeenan commented 1 month ago

I do think we need to define a file format for it if we go down this path to ease adoption and it should probably have a magic signature at the beginning. Maybe something like gzip is to deflate but with a simple 3-byte signature followed by the hash followed by the stream data.

Assuming we create a cli tool to do all of the work for compressing/decompressing them, I'll ping several of the current origin trial participants to see how it will fit into their workflow.

Something like:

I'm assuming something like this would be better off as it's own draft that this references or do you think it makes sense to define it here?

I agree there are significant benefits to having the hash paired directly with the resource. I just want to be careful to make sure whatever we do fits in well with the developer workflow as well.

nhelfman commented 1 month ago

I just want to be careful to make sure whatever we do fits in well with the developer workflow as well.

Adding the header would add some work to typical CI workflow. At least in my case, the diff file stream is created using the brotli cli using the -Z -D options. Adding the header would require creating a new file with the header and add the generated stream.

If this can be done in a published script it would simplify the logic (FYI - node.js currently does yet not support brotli bindings, see https://github.com/nodejs/node/issues/52250) but not a requirement IMO.

Do I understand correctly that with this idea implemented the Content-Dictionary header would not be necessary anymore? That would simplify some CDN configurations which currently needs to add it.

Related to this - what in this case would be Content-Encoding header value? current spec is it should be br-d which I understand as brotli dictionary stream. If we add the header, how would a client be able to determine which one it is? Should we add another content encoding for this?

pmeenan commented 1 month ago

Yes, this would eliminate the need for the Content-Dictionary header.

The Content-Encoding would be br-d and the definition of br-d would be changed to reference the DCB stream format of a brotli stream with the header prefix. There would be no content-encoding support for a bare brotli dictionary-compressed stream. At that point, maybe renaming them to dcb and dcz for the content encodings would also make things clearer.

horo-t commented 1 month ago

I have a question.

If we can use the new dcb and dcz in the Content-Encoding header, why do we need to have the "3-byte signature indicating hash and compression type" in the response body?

pmeenan commented 1 month ago

It's technically not required but it makes it safer and easier to operate on the files outside of the http case.

For example, here's some discussion on the brotli issue tracker from 2016 asking for the same: https://github.com/google/brotli/issues/298

On Wed, Apr 17, 2024 at 7:32 PM Tsuyoshi Horo @.***> wrote:

I have a question.

If we can use the new dcb and dcz in the Content-Encoding header, why do we need to have the "3-byte signature indicating hash and compression type" in the response body?

— Reply to this email directly, view it on GitHub https://github.com/httpwg/http-extensions/issues/2770#issuecomment-2062668884, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBNARW65ZIFCFCQQLPDY54BB5AVCNFSM6AAAAABFYFNZAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGY3DQOBYGQ . You are receiving this because you commented.Message ID: @.***>

martinthomson commented 1 month ago

I'm assuming something like this would be better off as it's own draft that this references or do you think it makes sense to define it here?

I don't have a problem with doing that in this document, if that is the way people want to go. I personally don't think that a new format is needed because this is a content-encoding, not a media type. But if a media type (and tooling) helps with the deployment of the content-encoding, then maybe that is the right thing to do.

Either way, I don't think that you should make that decision (it's a significant one) without having a broader discussion than just this issue.

yoavweiss commented 1 month ago
  1. Requests could use delta encoding.

@martinthomson - could you help me understand this one? (mainly, why wouldn't we be able to apply delta encoding in the absence of an in-body hash?)

This will definitely add complexity (e.g. the need to define a new file format that would wrap dictionaries, along with required tooling). It's not currently clear to me what the advantage of this would be. Do we predict the use of this HTTP mechanism outside of an HTTP flow? If so, concrete examples would be helpful.

pmeenan commented 1 month ago

@yoavweiss is the complexity you are concerned about limited to tooling and spec process or do you also see it as being more complex after we are at a good state with tooling?

Assuming the brotli and Zstandard libs and cli tools have been updated to add a flag for "embedded dictionary hash" format streams, does one start to look better?

For me, the hash being embedded in the stream/file removes fragility from the system. It blocks the decode of the file with the wrong dictionary and enables rebuilding of the metadata that maps dictionaries to compressed files if, for some reason, that metadata got lost (file names truncated, etc).

It also feels like it simplifies the serving path a bit, bringing us back to "serve this file" based on the Available-Dictionary request header vs "serve this file and add this http response header".

On the size side of things, I expect it will likely be a wash. Delta-compressed responses will be a few bytes smaller because the header (name and value) is larger than the file header (10's of bytes, not huge by any means). In the dynamic resource case where multiple responses will re-use the same dictionary, the header can be compressed away with HPACK/QPACK, making the header case a bit smaller.

I don't think it's a big change in complexity/fragility one way or the other but it does feel like there are fewer moving pieces once we get to a place where the tooling is taken care of to have the file contents themselves specify the dictionary they were compressed with. The need for tooling changes would delay adoption a little bit so it's not a free decision but I want to make sure we don't sacrifice future use cases and simplicity for an easier launch.

martinthomson commented 1 month ago

I'm not sure that I see the tooling process as critical, relative to the robustness. And performance will be a wash (though I tend to view the first interaction as more important than repeated interactions).

For tooling, if this content is produced programmatically, then there should be no issue. Integrated tooling can do anything. If content is produced and stored in files, then hash dictionary > output; compress content >> output seems fine. That's a one-way operation generally.

I don't see the definition of new media types to be necessary as part of that. I don't see these files being used outside of HTTP, ever. Maybe you could teach the command line decompression tools to recognize the format and complain in a useful way, but that's about the extent of the work I'd undertake there. You could do as @pmeenan suggests as well, which would be even more useful, but given the usage context, that's of pretty narrow applicability.

yoavweiss commented 1 month ago

I'm mostly concerned about official tooling and the latency of getting them to where developers need them to be.

Compression dictionaries already require jumping through some custom hoops, due to latency between brotli releases and adoption by the different package managers. At the same time, if y'all feel strongly about this, this extra initial complexity won't be a deal breaker.

pmeenan commented 1 month ago

I think there are enough robustness benefits that it is worth some short-term pain that hopefully we will all forget about in a few years.

On the header side of things, how do you all feel with respect to a bare 32-byte hash vs a 35-byte header with a 3-byte signature followed by the hash (or a 3-byte signature followed by a 1-byte header size followed by the hash to allow for changes as well as 4-byte alignment)?

It's possible I'm mentally stuck in the old days of sniffing content but since the hash can literally be any value, including accidentally looking like something else, I like the explicit nature of a magic signature at the beginning of the stream.

It essentially becomes echo "DCB" > output; hash dictionary >> output; compress content >> output (or whatever the signature is) so it doesn't add significantly to the complexity but it does add 3 bytes, potentially unnecessarily.

yoavweiss commented 1 month ago

+1 to adding a 3 byte magic signature if that's the route we're taking.

martinthomson commented 1 month ago

I'm ambivalent on the added signature, so I'll defer to others.

I can see how that might help a command-line tool more easily distinguish between this and a genuine brotli-/zstd- compressed file and do the right thing with it. On the other hand, it's 3 more bytes and - if the formats themselves have a magic sequence - the same tools could equally skip 32 bytes and check for their magic there.

felixhandte commented 1 month ago

I think I am having the same response as you all in the opposite direction, where to me it feels preferable to make it HTTP's problem so that my layer doesn't have to deal with additional complexity. :smiley:

But if I overcome that bias and accept that it would be nice to avoid an additional HTTP header, here's what I think:

If we used a Zstd skippable frame, that would change the stream overhead to 8 bytes (4 byte magic + 4 byte length) + 32 byte hash. But it would mean that existing zstd binaries would be able to decode the frames and avoid ecosystem fragmentation. That's a lot more attractive to me than a new format along the lines of "DCZ" + hash + zstd frame which (1) existing tools won't understand and (2) isn't a format I'd feel comfortable auto-detecting by default in the zstd CLI because DCZ isn't all that unlikely a beginning of a string in the way that \x28\xB5\x2F\xFD is (non-ASCII, invalid UTF-8).

And I've thought about it more and I'm actually not concerned about colliding a skippable frame type with someone else's existing use case. It would be more of a problem if we were trying to spec a universal solution, but if we scope this to just this content-encoding, then we're free to reserve whatever code points we want and attach whatever semantics we want.

pmeenan commented 1 month ago

Thanks. I guess the main question I have is if there would be benefits to Zstandard itself for the files to be self-contained with the identification of the dictionary that they were encoded with or the possibility of mismatching dictionaries on compression and decompression (or being able to find the matching dictionary given just the compressed file) are issues that are HTTP-specific.

I'm fine with specifying that the encoded files carry the dictionary hash (before the compressed stream data) and having different ways for Zstandard and Brotli to do that actual embedding.

That said, the tooling gets more complicated on the encode and decode side to generate the frame and insert it in the correct place in the compressed file and at decompression time, that makes the client more format-aware, having to parse more of the stream to extract the dictionary and then re-send the full stream through the decoder (at least until the decoder library becomes aware of embedded dictionary hash).

martinthomson commented 1 month ago

@felixhandte

skippable frame

Is this really what you want in this case? The decoder needs to know where to find the dictionary, so wouldn't you want this to be a breaking change to the format such that a decoder that has a dictionary is fine and a decoder that doesn't knows to go get one. ... And - importantly - an older decoder will abort. (I confess that I don't know what the zstd frame extension model is and didn't check.)

felixhandte commented 1 month ago

I don't think we're contemplating a model where Zstd can ingest a frame and figure out on its own where to find the dictionary it needs and then load it and use it. I expect that the enclosing application will parse this header, get the hash, and then find the dictionary and provide it to Zstd.

The advantage to using the skippable frame is that then you can provide the whole input to Zstd (including existing versions) and it will work rather than have to pull the header off.

yoavweiss commented 1 month ago

One thing I realized now - by assuming that the hash length is 32 bytes, we're assuming the hash would remain SHA-256 forever. That might be how things play out, but it might also be the case that we'd need to change hashes at some point.

If we were to do that, having a fixed length hash as part of the format would make things more complex.

martinthomson commented 1 month ago

Having a fixed-length hash (or fixed hash) as part of a content coding is perfectly fine. If there is a need to update hashes, it is easy to define a new content coding.

pmeenan commented 3 weeks ago

@felixhandte the current PR doesn't use skippable frames and uses the same custom header for Brotli and Zstandard (with different magic numbers).

I can switch to using a skippable frame instead (which effectively just becomes an 8-byte magic number since the frame length is always the same) but I'm wondering if it makes sense and is worth adding 4 bytes.

It won't help in creating the files so it's just for decode time and the main benefit you get is that you can use the existing zstd cli and libraries to decode the stream without skipping the header but those also won't verify the dictionary hash, they will just skip over it. That might not be a problem but part of the decode process will be to fail the request if the hashes don't match.

pmeenan commented 3 weeks ago

In talking to the Brotli team, it looks like Brotli already embeds a hash of the dictionary and validates it during decode to select the dictionary to use.

It uses a "256-bit Highwayhash checksum" so we can't use the hash to lookup a dictionary indexed by SHA-256 but we can use it to guarantee the decompression doesn't use a different dictionary (and the existing libraries and cli tools already use it).

@martinthomson when you were concerned about the client identifying which dictionary was used by the server, was it for both lookup and validation or just validation?

I'm just wondering if we can use the existing brotli streams as they already exist or if we should still add a header to support locating the dictionary by SHA-256 hash.

martinthomson commented 2 weeks ago

There are a few places in HTTP where the interpretation of a response depends on something in the request. Everyone one of those turns out to be awful for generic processing of responses. That's my main reasoning, so I'd say both: lookup first, then validation.

That is, my expectation is not that the client has a singular candidate dictionary, or that it commits to one in particular. So in the original design, when you had the client pick one, that didn't seem like a great idea to me.

For validation, Highwayhash (NIH much?) doesn't appear to be pre-image resistant, so I'd be concerned if we were relying on the pre-image resistance properties of SHA-2. Can we be confident that this is not now subject to potential ployglot attacks if we relied on that hash alone? That is, could two clients think that they have the same resource, but do not?

pmeenan commented 2 weeks ago

I wouldn't be comfortable switching hashes given the wider and proven use of SHA-256 and agree on the lookup case to allow content-encoding with different and multiple dictionary negotiations.

Looks like a separate header is still the cleanest so the main remaining question is if we use the same style of header for both or a skippable frame for Zstandard.

I also need to update the brotli citations to point to the shared dictionary draft format instead of the original brotli format.

On Sun, May 12, 2024 at 7:58 PM Martin Thomson @.***> wrote:

There are a few places in HTTP where the interpretation of a response depends on something in the request. Everyone one of those turns out to be awful for generic processing of responses. That's my main reasoning, so I'd say both: lookup first, then validation.

That is, my expectation is not that the client has a singular candidate dictionary, or that it commits to one in particular. So in the original design, when you had the client pick one, that didn't seem like a great idea to me.

For validation, Highwayhash (NIH much?) doesn't appear to be pre-image resistant, so I'd be concerned if we were relying on the pre-image resistance properties of SHA-2. Can we be confident that this is not now subject to potential ployglot attacks if we relied on that hash alone? That is, could two clients think that they have the same resource, but do not?

— Reply to this email directly, view it on GitHub https://github.com/httpwg/http-extensions/issues/2770#issuecomment-2106417158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBO57DUKRCY2FZN6JBTZB762LAVCNFSM6AAAAABFYFNZAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBWGQYTOMJVHA . You are receiving this because you were assigned.Message ID: @.***>

ioggstream commented 2 weeks ago

Sorry for the late comment. It seems to me that conveying this information in the content eases the integration with Signatures and Digest.

It is not clear to me if there are still possible cases where the response does not contain all the information required for processing.

ftservice commented 2 weeks ago

Thanks