Open rebeccacremona opened 2 years ago
Related to #3038
Throwing a thought in this ticket: if we continue to send our stuff to Internet Archive after this migration, we will probably want to continue sending them WARCs instead of WACZs. I think they will prefer to derive their own CDX lines and store them in their own format like they do currently.
Discussed today - https://hlslil.slack.com/archives/C07URASMC/p1659730375523229
Summary:
.wacz
s, which are meant to be range-streamed and are cached differently than.warc.gz
s by<replay-web-page>
, would help alleviate the limitations of client-side storage we're dealing with..wacz
files by converting.warc.gz
on the fly as they are requested, and storing the resulting artifact..wacz
s by default.Would running a small scale experiment locally on a batch of X archives first, with automated checks to identify edge cases, be a good first step?