Browsertrix: support multi-wacz crawls

starlinglab / integrity-v2

Monorepo for the next iteration of Starling Lab's integrity pipeline.

MIT License

1 stars 3 forks source link

After chatting on Discord, it looks by using the download API instead of the URL, we can get a single-file "multi-wacz". This should solve the underlying semantic problem of how to deal with multiple WACZ files for one crawl. Code changes are still needed to actually use this API.

The problem is that for single WACZ, this means we are downloading an unnecessary wrapper file, the multi-wacz. A ZIP containing a WACZ. Replayweb.page supports this format, but is this what we want to standardize on internally? Idk...

starlinglab / integrity-v2

Browsertrix: support multi-wacz crawls #59