iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

Implemented github-73. Keeping WARC payload digest unchanged for CDX #74

Closed thomasegense closed 1 year ago

thomasegense commented 1 year ago

My attempt to add support for CDX indexing using the payload digest from the WARC-header without base64 encoding it.

To enable it, I added two additional arguments to the CDX indexer

case "-d": case "--digest-unchanged":

There is a unittest as well.

ato commented 1 year ago

Thanks! Released as 0.24.0.

This sparked some ideas for refactoring so I added a few follow-up commits:

  1. I replaced WarcTargetRecord.payloadDigestUnchanged() with a WarcDigest.raw() method. This way we can get at the unchanged value everywhere else that returns a WarcDigest too (e.g. block digests, WarcPayload.digest()). I also added .raw() method on MediaType too.

  2. I added a builder to CdxFormat so that we don't mutate the singleton CDX9-11 objects.

  3. I centralized the CDX field formatting code in CdxFormat instead of the way I previously had it awkwardly split between CdxFields and CdxFormat. This means we don't have to pass the formatting options into CdxFields.format().