starlinglab / integrity-v2

Monorepo for the next iteration of Starling Lab's integrity pipeline.
MIT License
0 stars 3 forks source link

Browsertrix: support multi-wacz crawls #59

Open makew0rld opened 1 month ago

makew0rld commented 1 month ago

Multiple WACZs are created for crawls every 10 GB, and also if there are multiple crawler instances. This scenario needs to be tested to see what the webhook request looks like and how to handle it. Currently the code will definitely not handle it correctly.

What it should like is multiple entries in the resources array, each with their own download link.

makew0rld commented 4 weeks ago

After chatting on Discord, it looks by using the download API instead of the URL, we can get a single-file "multi-wacz". This should solve the underlying semantic problem of how to deal with multiple WACZ files for one crawl. Code changes are still needed to actually use this API.

image

The problem is that for single WACZ, this means we are downloading an unnecessary wrapper file, the multi-wacz. A ZIP containing a WACZ. Replayweb.page supports this format, but is this what we want to standardize on internally? Idk...