Closed kris-sigur closed 10 years ago
A compromise could be to allow creation of non-standard headers with naming in line with non-standard MIME-types. According to the standard for MIME-types, any non-standard type should start with "X-".
If this naming convention was a part of the WARC-standard we could add headers like "X-My-Header". Tools could just skip headers starting with "X-" if they don't understand them. And there's no risk of name clashes if new headers are added to the standard in the future since non of them should start with "X-".
Although, I'm afraid RFC 6648 actually deprecated the use of x-
prefixes, so maybe we should take their rationale into account.
Is it safe for a WARC parser to ignore any header that it does not understand? Ideally, it should be, I think.
@anjackson Agreed.
For those interested appendix B of RFC 6648 contains most of the rationale against the X-
format, beginning with:
The primary problem with the "X-" convention is that unstandardized
parameters have a tendency to leak into the protected space of
standardized parameters, thus introducing the need for migration from
the "X-" name to a standardized name.
It looks like I was wrong and this is already resolved in the standard. In chapter 4 I found:
Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-fields and extension-fields follow the generic named-field format. Extension fields may be used in extensions of the core format.
Unless I'm reading this wrong, this means that fields, other than those specified in the standard are allowed and simply need to follow the same constraints as defined-fields.
The spec mentions numerous headers (e.g. WARC-Record-ID) for different record types and profiles.
It is however silent (as far as I can tell) on whether other headers are allowed.
One option is to disallow them entirely. Meaning WARCs with headers not covered by the spec are viewed as invalid. This limits the formats flexibility but does make it easier building validation tools.
If additional headers are allowed, the main risk is that if a new header is added to the spec in a future revision, it may already be in use in a manner not conformant with how the revised spec dictates.