In Browsertrix Crawler, we are already generating CDXJ indices per-WARC, so it would be faster to use these existing indices rather than indexing from the WARCs again. We are proposing adding a --cdxj CLI argument that can pass a directory of existing CDXJ files, similar to how --pages already works.
I have a PR in progress, just needs a bit more testing, will submit shortly. Thanks!
Related to https://github.com/webrecorder/browsertrix-crawler/issues/484
In Browsertrix Crawler, we are already generating CDXJ indices per-WARC, so it would be faster to use these existing indices rather than indexing from the WARCs again. We are proposing adding a
--cdxj
CLI argument that can pass a directory of existing CDXJ files, similar to how--pages
already works.I have a PR in progress, just needs a bit more testing, will submit shortly. Thanks!