Closed yotann closed 3 years ago
Thanks again for this, and apologies for never getting back to it!
I will implement this soon anyway in one of my projects and test it extensively on existing .warc.zst
files. Should there be any inconsistencies, I will hopefully catch them then, to be resolved in a follow-up PR. :-)
As predicted, here's the first stackoverflow question with a zstd problem "Dictionary mismatch":
https://stackoverflow.com/questions/68349984/how-to-decompress-a-warc-zst-file
Add a draft specification for compressing WARC files with Zstandard. Would close #53; see discussion there.
This specification is intended to be compatible with the
.warc.zst
files already being generated by Archive Team.Aside from the questions raised at #53, one more occurred to me: should there be a magic number just for
.warc.zst
files? With the current format, the only way to distinguish a.warc.zst
file from a different kind of.zst
file is to decompress the first few bytes. We could add a new skippable frame to the beginning of the file that marks it as a WARC file. Then again, I'm not sure implementers would want to bother with that.