iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Clarify whether Transfer-Encoding can or should be preserved #22

Closed anjackson closed 6 years ago

anjackson commented 9 years ago

As I fiddle about with a WARC-writer, I've come across something which I think would benefit from a bit of clarification. Section 6.3.2 of the current version of the spec says:

The payload of a ‘response’ record with a target-URI of scheme ‘http’ or ‘https’ is defined as its ‘entity-body’ (per [RFC2616]), with any transfer-encoding removed. If a truncated ‘response’ record block contains less than the full entity-body, the payload is considered truncated at the same position.

In context, it is not clear whether this is simply clarifying what the definition of the payload is (for the purpose of digest calculation), or whether it is proposing that the response should be stripped of any Transfer-Encoding.

I imagine we do want to preserve the Transfer-Encoding where possible. So for example, a recent bug was fixed in wget because it was producing WARC records like this one, where the response header says Transfer-Encoding: chunked but where the response was included directly (not chunked). This has been changed, and wget now produces records like this one where the chunked encoding can be seen in the response.

I think we would all agree that this is the right approach, and we should preserve the bytes that came over the wire, whatever transfer coding was used. If so, I'd like to make this clearer in the spec. or in accompanying documentation. This should also make it clear that the WARC-Payload-Digest refers to the digest of the payload stripped of any Transfer-Encoding, whereas the WARC-Block-Digest refers to the digest of the whole HTTP response including any Transfer-Encoding of the entity body.

gojomo commented 9 years ago

IIRC, this "defined as" means that when you say 'payload', you mean the entity-body without-transfer-encoding.

It doesn't mean that the WARC bytes themselves have had the TE removed. That is: still (always!) store the response verbatim, but when describing/accessing/checksumming a 'payload', the TE should be removed.

I also believe this matches the practice of the HTTP specs: the 'entity-body' is what's left after TE has been decoded-away, not the bytes-including-TE-cruft. (Perhaps it'd be clearer to emphasize that for HTTP 'response' records, WARC 'payload' is exactly HTTP 'entity-body', not HTTP 'message-body', per http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.3 .)

saraaubry commented 9 years ago

The following changes have been integrated in the revised ISO draft during the ISO working group meeting on November 16-17, 2015:

Idem in 6.4.2 and 6.5.2.

nlevitt commented 8 years ago

The new text still isn't entirely clear to me. Is it just a definition of the word "payload" (as @gojomo was suggesting), or is it specifying what should appear in the warc record? My impression is the latter.

Can I suggest we replace the text "‘entity-body’ (per [RFC2616]), with any transfer-encoding" with "‘message-body’ (per [RFC2616])".

nlevitt commented 8 years ago

I also want to comment on the WARC-Payload-Digest issue. @anjackson wrote

This should also make it clear that the WARC-Payload-Digest refers to the digest of the payload stripped of any Transfer-Encoding

That actually doesn't seem clear at all. If I understand @saraaubry's comment correctly, the spec intends to define the "payload" as the RFC2616 message-body, i.e. not stripped of any Transfer-Encoding.

On the other hand, section 5.9 WARC-Payload-Digest says "The payload of an application/http block is its ‘entity-body’ (per [RFC2616])".

Clearly these two sections need harmonization.


If WARC-Payload-Digest is meant to be the digest of the RFC2616 entity-body, there are subtleties in playback of revisit records that need to be considered. When playing back an "identical-payload-digest" revisit record, the http headers from the revisit record are played back, with the content from the original record. Therefore, playback software needs to correctly handle the case where the original record and the revisit record have different Transfer-Encoding (e.g. one chunked, one not). This might involve chunking an unchunked payload, or unchunking a chunked payload, or adding or removing a Transfer-Encoding header.

gojomo commented 8 years ago

While the confusion suggests that perhaps the original WARC specs could have been clearer, I believe that carefully read, they were unambiguous, especially in the case of HTTP 'response' records. The record is to be the exact same bytes as the HTTP spec's 'Response' message:

For a target-URI of the 'http' or 'https' schemes, a 'response' record block should contain the full HTTP response received over the network, including headers. That is, it contains the 'Response' message defined by section 6 of HTTP/1.1 (RFC2616), or by any previous or subsequent version of HTTP compatible with the section 6 of HTTP/1.1 (RFC2616).

There's no allowance there for decoding (or 'removing') any Transfer-Encodings. It's what was delivered over HTTP.

The definition of 'payload' is separate, and only meaningful in the context of other headers. ('Payload' is not used in the space to describe how WARC records should be constructed, only how they should be interpreted.)

I'm sorry I hadn't noticed @saraaubry's report of the November 2015 edits previously, but my interpretation is that those changes don't clear up the confusion (and could be interpreted as reversing the original intent and contradicting other practices).

As a clarifying example, two responses with different transfer-encodings (perhaps even because of different 'chunked' boundaries), but which (when the transfer-encoding is 'removed') result in the exact same 'entity-body' (per HTTP), should have the same WARC-Payload-Digest. But if the 'payload' is said to be the bytes with the transfer-encoding retained, that won't be the case. The desirable referential qualities of 'payload' require it to refer to the post-transfer-decoded 'entity-body'.

saraaubry commented 8 years ago

I think we should stick to entity body, as @gojomo suggested. According to the RFC 2616:

The entity-body (if any) sent with an HTTP request or response is in a format and encoding defined by the entity-header fields. entity-body = *OCTET An entity-body is only present in a message when a message-body is present, as described in section 4.3. The entity-body is obtained from the message-body by decoding any Transfer-Encoding that might have been applied to ensure safe and proper transfer of the message.

Sticking to this definition, we could just out the words "with any transfer-encoding removed".

in section 6.3.2: The payload of a ‘response’ record with a target-URI of scheme ‘http’ or ‘https’ is defined as its ‘entity-body’ (per [RFC2616]), with any transfer-encoding removed. If a truncated ‘response’ record block contains less than the full entity-body, the payload is considered truncated at the same position.

saraaubry commented 6 years ago

Included in WARC 1.1