kaitai-io / kaitai_struct_formats

Kaitai Struct: library of binary file formats (.ksy)
http://formats.kaitai.io
702 stars 202 forks source link

Zlib #396

Open KOLANICH opened 3 years ago

KOLANICH commented 3 years ago
meta:
  id: deflate_stream
  title: ZLib-compressed blocks
  application: zlib
  xref:
    justsolve: Zlib
    mime: application/zlib
    rfc: 1951
    wikidata: Q2712
  endian: le
  bit-endian: le

doc: |
  Blocks compressed with zlib - a compression format designed by Mark Adler

doc-ref:
  - https://github.com/madler/zlib
  - https://github.com/golang/go/blob/f90e89e/src/compress/flate/inflate.go#L301
WiP: "https://gist.github.com/generalmimon/0f202457ebe8f1556293d611a949c358"
generalmimon commented 3 years ago

FWIW, I started working on this format. It is commonly called zlib compression, but this is quite unfortunate designation (it leads to false impressions). zlib is mainly a library for Deflate compression and decompression (see https://zlib.net/).

See this good summary on Stack Overflow by Mark Adler (https://stackoverflow.com/a/20765054):

The zlib library supports Deflate compression and decompression, and three kinds of wrapping around the deflate streams. Those are: no wrapping at all ("raw" deflate), zlib wrapping, which is used in the PNG format data blocks, and gzip wrapping, to provide gzip routines for the programmer. The main difference between zlib and gzip wrapping is that the zlib wrapping is more compact, six bytes vs. a minimum of 18 bytes for gzip, and the integrity check, Adler-32, runs faster than the CRC-32 that gzip uses. Raw deflate is used by programs that read and write the .zip format, which is another format that wraps around deflate compressed data.

It's important to understand that the method of data compression, which is called zlib, "raw" deflate and gzip, is one and the same - the only difference is in the wrapping (envelope).

(Note: gzip and zlib headers might have some optional fields after the mandatory part, but that's out of scope of this basic intro.)

Kaitai Struct supports only the process: zlib decompression out-of-the-box, which requires the zlib header to be present (at the beginning of the compressed data). That's quite unfortunate, because it denies you to decompress raw deflate and gzip. It would be better to have process: deflate decompressing the raw deflate data, and parsing the zlib and gzip header and "footer" fields with a KSY spec.

And to the references that you linked - RFC 1950 really just describes the zlib header and footer (from which you can't read anything useful about the compressed data), the actual compression method is documented in RFC 1951 ("DEFLATE Compressed Data Format Specification version 1.3"). Also, the Wikidata item Q207240 refers to the zlib library, not to the compression format - DEFLATE - data decompression algorithm (Q2712) would be more appropriate.

I can think of a terminological thing to mention, in case someone isn't aware - deflating means "letting air or gas out of a baloon" and DEFLATE means the compression (reducing the size), and inflating is the opposite - it means "filling a balloon with air or gas" and it refers to the decompression. It's a quite funny analogy I think 😃


My WIP .ksy spec for the deflate stream is here: https://gist.github.com/generalmimon/0f202457ebe8f1556293d611a949c358

I consider the RFCs and docs pretty much incomprehensible and not practical (i.e. you are often left to devise your own specific algorithms by yourself) and I don't fancy C code at all (zlib library), but Go language has a pretty good and legible DEFLATE implementation. The parsing of the deflated stream starts here: golang/go > src/compress/flate/inflate.go:301

So most of the deflate_stream.ksy spec is actually based on stepping the Go flate implementation in the VSCode debugger (coming from the Go For Visual Studio Code extension). The debugger is really helpful, because you can see the algorithm step-by-step, reimplement a part of it in the KSY and check if the intermediate values shown in the debugger are the same as from the KSY. FWIW, this is the application code that I was stepping: https://play.golang.org/p/E5dLnJRZ4Im

The beginning is pretty simple, but it becomes ugly fast. You often need to implement various counters and collector variables, the Go implementation uses various mutable byte arrays for example, so this needs to be converted to the immutable paradigm for KSY usage, etc. I'm not sure if the KSY spec can be even finished.

I don't think I'm going to actively work on the spec in the near future, so if anyone feels like that, please let me know.

KOLANICH commented 3 years ago

Kaitai Struct supports only the process: zlib decompression out-of-the-box, which requires the zlib header to be present (at the beginning of the compressed data). That's quite unfortunate, because it denies you to decompress raw deflate and gzip. It would be better to have process: deflate decompressing the raw deflate data, and parsing the zlib and gzip header and "footer" fields with a KSY spec.

kaitai_compress has a PR fixing that for python.

And to the references that you linked - RFC 1950 really just describes the zlib header and footer (from which you can't read anything useful about the compressed data), the actual compression method is documented in RFC 1951 ("DEFLATE Compressed Data Format Specification version 1.3"). Also, the Wikidata item Q207240 refers to the zlib library, not to the compression format - DEFLATE - data decompression algorithm (Q2712) would be more appropriate.

Fixed, thanks.

generalmimon commented 3 years ago

kaitai_compress has a PR fixing that for python.

Thanks, I wasn't aware of it. I will look into.