Support indexing WACZ files

machawk1 commented 3 years ago

Via @ikreymer, Web Archive Collection Zipped (WACZ) Format, https://github.com/webrecorder/wacz-format (MIT, potentially reusable)

Example of MDN WACZ at https://twitter.com/webrecorder_io/status/1293730279824089088

https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/mdn.wacz (1.6GB)

Finalizing Issue #604 (resolving #631) would be conducive here depending on the WACZ's contents. Also, hosting some larger WARCs remotely like this, because they are beyond the size restrictions on GitHub, could serve as the means for testing scalability.

machawk1 commented 2 years ago

WACZ files can be interpreted as a ZIP file with a defined structure. The target for ipwb (WARCs) are in /archive. Thus, the WACZ file should be read, interpreted as a ZIP, the WARC files in /archive extracted, and said files sent to the ipwb indexer.

In the future, we may want to consider the additional context that WACZ provides.

Sample WACZ https://play.archipelago.nyc/do/10/iiif/3546d9bd-a25c-4ba1-b96f-29411c0d752a/full/full/0/etd.wacz

machawk1 commented 2 years ago

Preliminary support added in 779978a. WACZ detection should be improved but importing py-wacz incurs others dependencies due to pywb coupling.

machawk1 commented 2 years ago

Also, is_zipfile() fails on WACZ files due to the magic number (signature) not matching that of a ZIP file.

machawk1 commented 2 years ago

In 9436999, I created a wacz using:

wacz create -o ./samples/wacz/my-collection.wacz ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

...which produces a 79 KB file. Attempting to replay this in https://replayweb.page/ shows no URLs in the interface.

ikreymer commented 2 years ago

wacz create -o ./samples/wacz/my-collection.wacz ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

The command should include a -f before the WARCs files now (and should have better arg validation probably) try:

 wacz create -o ./samples/wacz/my-collection.wacz -f ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

machawk1 commented 2 years ago

@ikreymer Thanks for your proactive feedback here. I ran:

wacz create -o ./samples/wacz/my-collection.wacz -f ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc

and a 79 KB file my-collection.wacz.zip (.zip added only for GitHub upload) was generated. This WACZ does not cause any URLs to be recognized in replayweb.page.

wacz 0.4.6 installed via pypi, macOS 12.3.1, Python 3.10.4

- -

EDIT: When decompressing the WACZ, the WARCs are present. Perhaps pywb is having an issue replaying them -- they were not created w/ the webrecorder stack.

EDIT2: Uploading the WARCs directly to replayweb.page produces the same result -- no URL is shown in the interface. A next step will be to try these WARCs in pywb directly to see if any errors are reported.

EDIT3: warcio seems to work ok with these WARCs, for example:

from warcio.archiveiterator import ArchiveIterator
  with open ('./samples/warcs/5mementos.warc', 'rb') as stream:
    for record in ArchiveIterator(stream):
      if record.rec_type == 'response':
        print(record.rec_headers.get_header('WARC-Target-URI'))

produces:

http://memento.us/
http://memento.us/
http://memento.us/
http://memento.us/
http://memento.us/
http://someotherURI.us/
http://anothersite.us/

machawk1 commented 2 years ago

Base test added in 25e91ad but GH Action is reporting service issues.

machawk1 commented 1 year ago

Per a discussion w/ Mark G. @ IA, WACZ is supported at web-beta.archive.org/save for those with a "beta" account (which I have).

oduwsdl / ipwb

Support indexing WACZ files #710