harvard-lil / js-wacz

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.
MIT License
13 stars 3 forks source link

Add option to use existing CDXJ indices rather than indexing from WARCs #88

Closed tw4l closed 8 months ago

tw4l commented 8 months ago

Related to https://github.com/webrecorder/browsertrix-crawler/issues/484

In Browsertrix Crawler, we are already generating CDXJ indices per-WARC, so it would be faster to use these existing indices rather than indexing from the WARCs again. We are proposing adding a --cdxj CLI argument that can pass a directory of existing CDXJ files, similar to how --pages already works.

I have a PR in progress, just needs a bit more testing, will submit shortly. Thanks!