iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 8 forks source link

wget quirk: Content-Length off by one #29

Closed ato closed 3 years ago

ato commented 4 years ago

Some versions of wget generated WARC headers with an off by one Content-Length. This causes us to throw:

org.netpreserve.jwarc.ParsingException: invalid WARC trailer: a0d0a57

Examples:

Other implementations appear to ignore this error. Perhaps by simply skipping arbitrary numbers of CR and LF characters before reading the next record?

I don't want to silently ignore this but perhaps we could log a warning and attempt to continue.

sebastian-nagel commented 4 years ago

+1 to skip trailing empty lines, cf. warcio's archiveiterator.py. With per-record compressed WARC files the Content-Length is not really required for reading, it's more a validation feature, same as the digests.