iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

Add validate tool #60

Closed sebastian-nagel closed 3 years ago

sebastian-nagel commented 3 years ago

Add tool to validate WARC/ARC files: verify digests, Content-Length headers, media type syntax, syntax of (parse errors in) HTTP headers. Read the payload to find/trigger errors when decoding transfer or content encoding.

Note: #59 is required to properly verify digests.

$> java -cp target/jwarc-*.jar org.netpreserve.jwarc.tools.WarcTool validate

ValidateTool [-h] [-v] filename...

Options:

 -h / --help    show usage message and exit
 -v / --verbose log information about every WARC record to stdout

Exit value is 0 if all WARC files validate, 1 otherwise.
Errors and erroneous WARC records are logged to stderr.

$> java -cp target/jwarc-*.jar org.netpreserve.jwarc.tools.WarcTool validate -v test-resources/org/netpreserve/jwarc/cc.warc.gz 
Validating test-resources/org/netpreserve/jwarc/cc.warc.gz
  offset 0 (length 5392) response application/http;msgtype=response
    http://commoncrawl.org/
    HTTP/1.1 200 OK
    date: 2019-12-10T10:00:01Z
    payload media type: text/html
    payload digest pass
    block digest pass
sebastian-nagel commented 3 years ago

Yes, definitely. I'll update the PR accordingly.

ato commented 3 years ago

Thanks. Released as 0.15.0 with a couple of small tweaks. When I tested it I found DigestingMessageBody was producing the wrong digest as it was digesting the unused remaining part of the buffer instead of the bytes that had just been read.

sebastian-nagel commented 3 years ago

Thanks!