iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

CDX indexer: Keep calculated digest from WARC header #73

Closed thomasegense closed 1 year ago

thomasegense commented 1 year ago

The CDX indexer does base64 encoding of the digest.

This WARC header: WARC-Payload-Digest: sha256:b04af472c47a8b1b5059b3404caac0e1bfb5a3c07b329be66f65cfab5ee8d3f3 Will result in the digest from the cdx-indexer: WBFPI4WEPKFRWUCZWNAEZKWA4G73LI6APMZJXZTPMXH2WXXI2PZQ====

This is also inconsistent with what the PyWb cdx-indexer does.

Fix: Add an option to keep the digest as is, when making cdx-index.

ato commented 1 year ago

Fixed by #74.