iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

WARC Extensions for HTTP/2 proposal #41

Closed ato closed 6 years ago

ato commented 6 years ago

I've written up a formal version of the suggestion @ikreymer made in #15 for the handling of HTTP/2 records by encoding them as HTTP/1 with a new WARC-Original-Protocol header field. We probably should define how to record extra information of interest like server push events and so forth but they aren't semantically necessary so I think that can be done later either in a future revision of this extension or a separate document. I think it's important to standardise define a basic mechanism like this sooner rather than later as HTTP/2 is being widely adopted and tools like warcreate can't easily opt out like crawlers can.

I changed the WARC-Original-Protocol value from HTTP/2.0 in Ilya's comment to HTTP/2 as that's the official name for the protocol. [Edit: Now "h2" and "h2c" as they're the official protocol identifiers.]

I also added some guidelines explaining how to handle the reason phrase in the response message correctly and clarifying that programs should not write "HTTP/2" in the HTTP message header itself as doing so makes the message invalid.

Note that adopting this would not preclude later defining some future way for storing the full binary h2 protocol if anyone has a use case or strong desire for doing so.

Another alternative that's been discussed is using conversion records however conversion records are really only intended for converting payloads and don't really work as currently specified for translating protocols. Using conversion records would also not enable compatibility with existing tools.

Please consider, comment and correct. Any errors are my own, not Ilya's. ;-)

ato commented 6 years ago

So HTTP/2 itself defines two protocol identifiers "h2" (HTTP/2 over TLS) and "h2c" (HTTP/2 cleartext). On second thought it might make sense to use these instead. They're used both in TLS ALPN and the HTTP/1 Upgrade header.

ato commented 6 years ago

I've updated the proposal to suggest the use of identifiers from the ALPN protocol ID registry. That provides some guidance on how to handle other protocols (e.g. SPDY) if anyone has a need to.

anjackson commented 6 years ago

Should these be namespaced in some way? e.g.

WARC-Original-Protocol:  alpn-protocol-ids:h2c

I think it's probably fine to pin this new WARC field against RFC7301 only, rather than making it extensible, but other may feel differently about it?

A second issue was whether we want to get into the fact that some tools are perhaps effectively converting HTTP/1.1 down to HTTP/1.0 already, in the sense of removing things like chunked encoding. I guess HTTP/1.0 is a subset of HTTP/1.1 so this doesn't really matter?

ato commented 6 years ago

I'm not a fan of namespacing, seems like premature generalisation and unnecessary complexity to me. I'm ok with leaving it as just h2 and h2c, just thought it would be nice to have a clear way forward for any of the other registered protocols. I realise the Alpn regsitry is for a specific purpose though.

I don't consider removing transfer encoding downgrading to http/1.0 as it's still valid http/1.1.

ato commented 6 years ago

Oh I see, you mean because http/1.0 doesn't support transfer-encodings someone could use this same mechanism to strip them? Hmm, are there other differences though?

ato commented 6 years ago

Also on further consideration we probably do need to give some guidance on server push responses, even if it's just to say they may be linked with WARC-Concurrent-To and and should have the appropriate target URI. I'm not yet sure whether we should do more to distinguish them than that or not.

ato commented 6 years ago

After sleeping on it I realised a general protocol field would actually be very useful on its own accord and solve several problems at once. So I'm splitting that into a separate proposal #42.

ato commented 6 years ago

After discussing with Ilya in #warc on IIPC Slack I've written up an alternative option for push promise that doesn't involve new records types: #43