Closed thomasegense closed 1 year ago
Thanks! Released as 0.24.0.
This sparked some ideas for refactoring so I added a few follow-up commits:
I replaced WarcTargetRecord.payloadDigestUnchanged() with a WarcDigest.raw() method. This way we can get at the unchanged value everywhere else that returns a WarcDigest too (e.g. block digests, WarcPayload.digest()). I also added .raw() method on MediaType too.
I added a builder to CdxFormat so that we don't mutate the singleton CDX9-11 objects.
I centralized the CDX field formatting code in CdxFormat instead of the way I previously had it awkwardly split between CdxFields and CdxFormat. This means we don't have to pass the formatting options into CdxFields.format().
My attempt to add support for CDX indexing using the payload digest from the WARC-header without base64 encoding it.
To enable it, I added two additional arguments to the CDX indexer
case "-d": case "--digest-unchanged":
There is a unittest as well.