Closed sebastian-nagel closed 4 years ago
Thanks for collecting those examples. Yeah, I was thinking it'd probably make sense to have HTTP parsing lenient by default and WARC parsing strict by default with options for both.
Another error I've encountered is HTTrack-generated HTTP requests containing '#' (i.e. it sends the fragment part of the URL sometimes).
Note (see #33): the problem of white space before the colon in header lines could be considered as compliant following the comment about optional "linear white space" in RFC 2616:
implied *LWS The grammar described by this specification is word-based. Except where noted otherwise, linear white space (LWS) can be included between any two adjacent words (token or quoted-string), and between adjacent words and separators, without changing the interpretation of a field.
HttpParser strictly follows RFC 2616 / RFC 7230. It is definitely good to have a validating parser available to check and verify WARC writing software. However, web servers may not follow the RFC and also the WARC 1.1 spec does not require that the content of a response record is strictly following the HTTP spec.
While testing several WARC files of different origin, I've seen so far the following types of errors which make the strict HttpParser fail (see #23 regarding logging of errors):
white space before the colon in header lines "name: value" (http_header_exception_1.warc.gz, Common Crawl Aug 2018):
invalid character (control character) in header value (http_header_exception_3.warc.gz, Wget/1.17.1):
no space after status code in status line if message is empty (http_message_1.warc.gz, Wget/1.19.4, cf. NUTCH-2763:
To allow the usage of jwarc also for WARC files with invalid HTTP headers - no matter whether this happens because of bugs in the WARC writer or on the responding web server - a lenient HttpParser would be good to have. In addition, the WARC reader may just continue to read until the
\r\n\r\n
indicating the end of the header.