Open machawk1 opened 3 years ago
WACZ files can be interpreted as a ZIP file with a defined structure. The target for ipwb (WARCs) are in /archive
. Thus, the WACZ file should be read, interpreted as a ZIP, the WARC files in /archive
extracted, and said files sent to the ipwb indexer.
In the future, we may want to consider the additional context that WACZ provides.
Sample WACZ https://play.archipelago.nyc/do/10/iiif/3546d9bd-a25c-4ba1-b96f-29411c0d752a/full/full/0/etd.wacz
Preliminary support added in 779978a. WACZ detection should be improved but importing py-wacz incurs others dependencies due to pywb coupling.
Also, is_zipfile()
fails on WACZ files due to the magic number (signature) not matching that of a ZIP file.
In 9436999, I created a wacz using:
wacz create -o ./samples/wacz/my-collection.wacz ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc
...which produces a 79 KB file. Attempting to replay this in https://replayweb.page/ shows no URLs in the interface.
wacz create -o ./samples/wacz/my-collection.wacz ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc
The command should include a -f
before the WARCs files now (and should have better arg validation probably)
try:
wacz create -o ./samples/wacz/my-collection.wacz -f ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc
@ikreymer Thanks for your proactive feedback here. I ran:
wacz create -o ./samples/wacz/my-collection.wacz -f ./samples/warcs/5mementos.warc ./samples/warcs/froggie.warc.gz ./samples/warcs/salam-home.warc
and a 79 KB file my-collection.wacz.zip (.zip added only for GitHub upload) was generated. This WACZ does not cause any URLs to be recognized in replayweb.page.
wacz 0.4.6 installed via pypi, macOS 12.3.1, Python 3.10.4
- -
EDIT: When decompressing the WACZ, the WARCs are present. Perhaps pywb is having an issue replaying them -- they were not created w/ the webrecorder stack.
EDIT2: Uploading the WARCs directly to replayweb.page produces the same result -- no URL is shown in the interface. A next step will be to try these WARCs in pywb directly to see if any errors are reported.
EDIT3: warcio seems to work ok with these WARCs, for example:
from warcio.archiveiterator import ArchiveIterator
with open ('./samples/warcs/5mementos.warc', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
print(record.rec_headers.get_header('WARC-Target-URI'))
produces:
http://memento.us/
http://memento.us/
http://memento.us/
http://memento.us/
http://memento.us/
http://someotherURI.us/
http://anothersite.us/
Base test added in 25e91ad but GH Action is reporting service issues.
Per a discussion w/ Mark G. @ IA, WACZ is supported at web-beta.archive.org/save for those with a "beta" account (which I have).
Via @ikreymer, Web Archive Collection Zipped (WACZ) Format, https://github.com/webrecorder/wacz-format (MIT, potentially reusable)
Example of MDN WACZ at https://twitter.com/webrecorder_io/status/1293730279824089088
https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/mdn.wacz (1.6GB)
Finalizing Issue #604 (resolving #631) would be conducive here depending on the WACZ's contents. Also, hosting some larger WARCs remotely like this, because they are beyond the size restrictions on GitHub, could serve as the means for testing scalability.