ClueWeb09 WARC files faile to parse

iipc / jwarc

Java library for reading and writing WARC files with a typed API

Apache License 2.0

48 stars 9 forks source link

ClueWeb09 WARC files faile to parse #26

Closed sebastian-nagel closed 5 months ago

sebastian-nagel commented 4 years ago

The ClueWeb09 dataset WARC files (see sample files) use a single line feed \n as separator between WARC headers. The WarcParser expects \r\n (which would conform to the standard) and fails:

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 9: WARC/0.18<-- HERE -->\nWARC-Type: warcinfo\nWARC-Date: 2009-03-...

See also #25 for a similar issue regarding HttpParser.

ato commented 5 months ago

I've added a lenient parsing mode which accepts \n line endings and the WARC/0.18 version string.

The ClueWeb09 sample files are quite broken though. They also have incorrect Content-Length headers. My guess is the files originally had \r\n line endings but then had them stripped to \n. I don't see a straightforward way to recover from that during parsing.

wumpus commented 5 months ago

It would be nice to have some examples distributed adjacent to the WARC standard of "these are some known broken WARC files out in the wild, you might want to think about either parsing them or at least giving good error messages when you see them".