iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
99 stars 30 forks source link

Add Zstandard compression draft #69

Closed yotann closed 3 years ago

yotann commented 3 years ago

Add a draft specification for compressing WARC files with Zstandard. Would close #53; see discussion there.

This specification is intended to be compatible with the .warc.zst files already being generated by Archive Team.

Aside from the questions raised at #53, one more occurred to me: should there be a magic number just for .warc.zst files? With the current format, the only way to distinguish a .warc.zst file from a different kind of .zst file is to decompress the first few bytes. We could add a new skippable frame to the beginning of the file that marks it as a WARC file. Then again, I'm not sure implementers would want to bother with that.

yotann commented 3 years ago

Rendered version.

JustAnotherArchivist commented 3 years ago

Thanks again for this, and apologies for never getting back to it! I will implement this soon anyway in one of my projects and test it extensively on existing .warc.zst files. Should there be any inconsistencies, I will hopefully catch them then, to be resolved in a follow-up PR. :-)

wumpus commented 3 years ago

As predicted, here's the first stackoverflow question with a zstd problem "Dictionary mismatch":

https://stackoverflow.com/questions/68349984/how-to-decompress-a-warc-zst-file