ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

Allow direct indexing of WACZ files #113

Open anjackson opened 10 months ago

anjackson commented 10 months ago

Might make more sense to integrate it into py-wacz, which has cdxj-indexer as a dependency.

e.g. follow how py-wacz validation works to go through the indexes (https://specs.webrecorder.net/wacz/1.1.1/#indexes and grab the zip offsets from the file to work out the whole-file offsets (https://stackoverflow.com/questions/44799018/how-to-get-offset-values-of-all-files-or-given-filename-in-a-zipfile-using-pyt).

Unit tests can go like: https://github.com/webrecorder/py-wacz/blob/47b3eefbaa8f70d839a048cc3d36d7014de06c2c/tests/test_validate_wacz.py

Validation of the approach should include indexing POST requests in OutbackCDX, see https://github.com/nla/outbackcdx/issues/106#issuecomment-1567980236

anjackson commented 9 months ago

Basic initial implementation now at https://github.com/webrecorder/py-wacz/pull/38