iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
95 stars 27 forks source link

Content-Type grammar inconsistent with examples #38

Open ato opened 5 years ago

ato commented 5 years ago

From WARC 1.1 section 5.6:

(or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively)

Note the space after the semicolon. However the grammar immediately following this prose disallows spaces in this position. It only allows them in a parameter value when enclosed in a quoted-string.

media-type    = type "/" subtype *( ";" parameter )
type          = token
subtype       = token
[...]
token         = 1*<any US-ASCII character>
                except CTLs or separators>
separators    = [...] | SP | HT

It appears revised HTTP standards have addressed this problem as the grammar in RFC 7231 explicitly allows optional white space in this position:

media-type = type "/" subtype *( OWS ";" OWS parameter )

Where OWS is defined in RFC 72301:

     OWS            = *( SP / HTAB )
                    ; optional whitespace

Future revisions / errata of the WARC standard should make the same grammar correction.

Note that Heritrix writes the Content-Type header for http requests and responses with spaces so a very large number of WARCs in the wild require this grammar change in order to be successfully parsed.

wumpus commented 5 years ago

Another example from in the wild: wget generates warcs without the space.