Open makew0rld opened 4 months ago
After chatting on Discord, it looks by using the download API instead of the URL, we can get a single-file "multi-wacz". This should solve the underlying semantic problem of how to deal with multiple WACZ files for one crawl. Code changes are still needed to actually use this API.
The problem is that for single WACZ, this means we are downloading an unnecessary wrapper file, the multi-wacz. A ZIP containing a WACZ. Replayweb.page supports this format, but is this what we want to standardize on internally? Idk...
Multiple WACZs are created for crawls every 10 GB, and also if there are multiple crawler instances. This scenario needs to be tested to see what the webhook request looks like and how to handle it. Currently the code will definitely not handle it correctly.
What it should like is multiple entries in the
resources
array, each with their own download link.