iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Are undefined headers allowed in WARC records #3

Closed kris-sigur closed 10 years ago

kris-sigur commented 10 years ago

The spec mentions numerous headers (e.g. WARC-Record-ID) for different record types and profiles.

It is however silent (as far as I can tell) on whether other headers are allowed.

One option is to disallow them entirely. Meaning WARCs with headers not covered by the spec are viewed as invalid. This limits the formats flexibility but does make it easier building validation tools.

If additional headers are allowed, the main risk is that if a new header is added to the spec in a future revision, it may already be in use in a manner not conformant with how the revised spec dictates.

johnerikhalse commented 10 years ago

A compromise could be to allow creation of non-standard headers with naming in line with non-standard MIME-types. According to the standard for MIME-types, any non-standard type should start with "X-".

If this naming convention was a part of the WARC-standard we could add headers like "X-My-Header". Tools could just skip headers starting with "X-" if they don't understand them. And there's no risk of name clashes if new headers are added to the standard in the future since non of them should start with "X-".

anjackson commented 10 years ago

Although, I'm afraid RFC 6648 actually deprecated the use of x- prefixes, so maybe we should take their rationale into account.

Is it safe for a WARC parser to ignore any header that it does not understand? Ideally, it should be, I think.

kris-sigur commented 10 years ago

@anjackson Agreed.

For those interested appendix B of RFC 6648 contains most of the rationale against the X- format, beginning with:

   The primary problem with the "X-" convention is that unstandardized
   parameters have a tendency to leak into the protected space of
   standardized parameters, thus introducing the need for migration from
   the "X-" name to a standardized name.
kris-sigur commented 10 years ago

It looks like I was wrong and this is already resolved in the standard. In chapter 4 I found:

Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-fields and extension-fields follow the generic named-field format. Extension fields may be used in extensions of the core format.

Unless I'm reading this wrong, this means that fields, other than those specified in the standard are allowed and simply need to follow the same constraints as defined-fields.