openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

JHOVE (WARC-kb) gives different results compared to JWAT #632

Open jacobtakema opened 4 years ago

jacobtakema commented 4 years ago

Issue: JHOVE WARC-KB module gives different results compared to JWAT

Because JHOVE WARC-KB module uses JWAT-WARC library it's expected that output results are similar

E.g. The WARC file LUX-004-TEST-2017-12-18-20171220042523987-00100-16828_wbgrp-crawl007.us.archive.org_8443.warc run through both tools.

When I ran Jhove 1.22 with the WARC-KB module. Output: gives 84 errors with the message 'Incorrect payload digest'

When I ran jwat 0.6.6. Output: INVALID_EXPECTED: 66 REQUIRED_INVALID: 44 'WARC-Target-URI' value: 110

So JWAT gives exclusively 110 'WARC-Target-URI' messages And JHOVE gives exclusively 84 'Incorrect payload digest' errors.

This gives significant different results.

Good to know is that: JWAT-Tools 0.6.6 contains JWAT-warc v1.11 JHOVE 1.22 contains JWAT-warc 1.0.3

So what's causing this (totally) different output results?

nclarkekb commented 2 years ago

Well those 110 errors in jwat-tools are because the -l (relaxed uri) is not use by default. And presumable relaxed uri validation is default in the jhove module.

nclarkekb commented 2 years ago

As for the digest. It is not computed correctly since one of the digest values is the digest of an empty string/bytearray. http://craiccomputing.blogspot.com/2009/09/sha1-digest-of-empty-string.html