harvard-lil / js-wacz

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.
MIT License
13 stars 3 forks source link

`indexWARC` worker: go back to single pass method #17

Closed matteocargnelutti closed 1 year ago

matteocargnelutti commented 1 year ago

Moved back to splitting pages detection and CDX generation in 0.0.6 after we discovered an off-by-one error affecting the results . This temporary fix is affecting performance (all WARCs are iterated over twice instead of once), but is likely temporary (maybe a simple programming mistake on my part?).

Underlying issue: https://github.com/webrecorder/warcio.js/issues/52

matteocargnelutti commented 1 year ago

Solved at warcio.js level by:


Implemented in js-wacz by: