Open yotann2 opened 4 years ago
(To clarify the compatibility issue: http://example.com/=?iso-8859-1?q?=31?= is a valid URI. If you archive it with one of the existing WARC tools, you'll get a field like WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?=
. If you feed that into a tool that supports RFC2047, it would decode the field into http://example.com/1
, which is different from the original URI.)
WARC 1.0 and 1.1 say that "the ‘encoded-word’ mechanism of [RFC2047] may also be used when writing WARC fields." However, RFC2047§5 gives strict limitations on which fields may hold encoded-words, and requires that encoded-words be separated from various other tokens with whitespace. The last version of HTTP that included encoded-words, RFC2616§2.2, also limited which fields may hold them. The WARC standards don't specify which of these requirements apply to WARC, if any. If they do apply, it would seem that encoded-words are not allowed in any of the standard WARC fields.
I've checked several WARC implementations and none of them actually produce or decode encoded-words. These tools can produce headers like
WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?=
that would cause compatibility issues with a hypothetical tool that supported encoded-words. Furthermore, even in web browsers, encoded-words are parsed inconsistently.Therefore, I propose removing RFC2047 encoded-words entirely from future versions of WARC.