iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Deprecate line folding #74

Open ato opened 3 years ago

ato commented 3 years ago

WARC inherited line folding from HTTP which presumably included it for compatibility with MIME messages which have line length limits. The newer HTTP RFCs deprecated it and disallowed its use by senders as it was a frequent source of security errors due to differences in implementations. Indeed some of the existing WARC implementations differ in how they interpret folded lines and none that I'm aware of will actually emit them.

I propose line folding be similarly deprecated in the next WARC version and a note included that writers of WARC files should not emit it.

JustAnotherArchivist commented 3 years ago

Yes please!

Although I'd also follow the HTTP way and forbid its use by WARC writers under the new version, not just discourage it. In other words: writers of WARC files should shall not emit it.

In fact, we could go even further than that and remove it from the next version entirely. This wasn't possible in HTTP because the version (1.1) stayed the same between RFCs 2616 and 7230, so an HTTP parser wouldn't be able to tell which RFC was used by the server. This problem does not exist here. Of course, WARC software would still have to support line folding to read WARC records with version 1.0 and 1.1 correctly, but I don't see a reason why it couldn't be outright removed in the future versions.

JustAnotherArchivist commented 3 years ago

Minor addition about folding implementations in the wild: wpull emits folded lines, albeit not in record headers, only in the warcinfo record body (which share the syntax rules). Specifically, there's a line length limit of 1024 characters, and at least one field exceeds that on every execution, plus a second one if the actual wpull command is very long.