Closed sebastian-nagel closed 5 months ago
I've added a lenient parsing mode which accepts \n line endings and the WARC/0.18 version string.
The ClueWeb09 sample files are quite broken though. They also have incorrect Content-Length headers. My guess is the files originally had \r\n line endings but then had them stripped to \n. I don't see a straightforward way to recover from that during parsing.
It would be nice to have some examples distributed adjacent to the WARC standard of "these are some known broken WARC files out in the wild, you might want to think about either parsing them or at least giving good error messages when you see them".
The ClueWeb09 dataset WARC files (see sample files) use a single line feed
\n
as separator between WARC headers. The WarcParser expects\r\n
(which would conform to the standard) and fails:See also #25 for a similar issue regarding HttpParser.