iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Encoded-words underspecified and unsupported #67

Open yotann2 opened 4 years ago

yotann2 commented 4 years ago

WARC 1.0 and 1.1 say that "the ‘encoded-word’ mechanism of [RFC2047] may also be used when writing WARC fields." However, RFC2047§5 gives strict limitations on which fields may hold encoded-words, and requires that encoded-words be separated from various other tokens with whitespace. The last version of HTTP that included encoded-words, RFC2616§2.2, also limited which fields may hold them. The WARC standards don't specify which of these requirements apply to WARC, if any. If they do apply, it would seem that encoded-words are not allowed in any of the standard WARC fields.

I've checked several WARC implementations and none of them actually produce or decode encoded-words. These tools can produce headers like WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?= that would cause compatibility issues with a hypothetical tool that supported encoded-words. Furthermore, even in web browsers, encoded-words are parsed inconsistently.

Therefore, I propose removing RFC2047 encoded-words entirely from future versions of WARC.

yotann2 commented 4 years ago

(To clarify the compatibility issue: http://example.com/=?iso-8859-1?q?=31?= is a valid URI. If you archive it with one of the existing WARC tools, you'll get a field like WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?=. If you feed that into a tool that supports RFC2047, it would decode the field into http://example.com/1, which is different from the original URI.)