quixdb / squash

Compression abstraction library and utilities
https://quixdb.github.io/squash/
MIT License
405 stars 53 forks source link

Filters #165

Open nemequ opened 8 years ago

nemequ commented 8 years ago

@inikep recently released XWRT. I'm thinking it may make sense to create a way to chain preprocessors onto compression/decompression operations, and add the ability to create plugins for preprocessors as well as compressors/decompressors.

"Preprocessors" probably isn't the best word; for decompression they're really "postprocessors". Perhaps "filters" would be a better word. This could also allow us to add layers for things like error correction, cryptography, streaming support for buffer-to-buffer compressors (a bit like how snappy-framed works), etc.

It could even be possible to create something like a TCP filter which just tunnels everything over a network, or an SSL filter which tunnels content over SSL (and would likely build on a TCP filter), though I feel like that might be more of a supported abuse of the idea than a use case we should explicitly target…

nemequ commented 8 years ago

I've been thinking about this in a bit more detail.

Requirements

  1. Symmetrical. decode(encode(M)) == M must be true. If you want to calculate a one-way hash, lossy compression, etc., use something else. OTOH, an encoder which adds a hash value and a decoder which verifies it and outputs the original data would definitely be appropriate.
  2. Generic. decode(encode(M)) == M must be true for all inputs. If you want to write a filter which works on executables, DNA data, etc., it can't emit an error if the data doesn't conform to what you expect. This could force you to use a small container to indicate whether to actually apply the filter or not, but I'm guessing that misuse of a filter would usually just result in a poor ratio and wasted CPU cycles.
  3. Synchronous. After thinking about the SSL and TCP ideas, I just don't see how it's possible with the zlib-style stream API. Things like networking can, and probably should, be done at a higher level.

Types

I don't think it would be right to think of compression codecs as "codecs" and everything else as "filters". Instead, everything would be a codec, and compression codecs would just be one type. I think it would be good to add an enumeration for to make the type easily introspectable, and indicate the codec type in the plugin's 'squash.ini'. Off the top of my head:

"Other* would basically be a catch-all; if codecs actually use it we should look at what they're doing and consider adding a new type to describe that use case.

"Container" is interesting because it could potentially overlap with everything else. I'm thinking about it as somewhere you might want to put a generic implementation of something like snappy-framed; if it's just a container for adding a checksum then "checksum" would be more appropriate. The type wouldn't really have any effect on Squash itself, it would just be used to provide more information to consumers.

Pipelines

In order to make this useful, I'm thinking the best way forward would be to create a subclass of SquashStream, tentatively SquashPipeline. Pipelines would basically just be a way to chain together multiple codecs. The API could be as simple as

SquashPipeline* squash_pipeline_new (SquashStreamType stream_type);
void squash_pipeline_add (SquashPipeline* pipeline, SquashStream* stream);

I'd also like to add a function to parse string desciptions of a pipeline. The syntax could be similar to what gst-launch (from gstreamer) expects, though I'd use vertical pipes instead of exclamation marks. So you could have a string like "lz4 level=9 | aes-gcm key=0xdeadbeef | ldpc" for compressing data with LZ4, encrypting it with AES-GCM, and adding LDPC for error correction. Or, for XWRT + Brotli, you could have "xwrt | brotli level=11". the API could still be quite simple:

SquashPipeline* squash_pipeline_new_parse (SquashStreamType stream_type, const char* pipeline);

If you make it a decoder stream then obviously everything would just be added in the reverse order.

Composability

One interesting thing is that pipelines would be composable. You could easily have a plugin which implemented a codec as a pipeline. For example, maybe you could have a "gzip" codec which is implemented as a pipeline of "deflate | adler32 | gzip-container" (I don't really know anything about the internals of the gzip format, maybe that doesn't make sense, but you get the idea).

CLI

One issue with this is I'm not sure how it would fit with the current CLI. The simplest solution would just be to make the codec argument allow pipeline description strings, but there would be some weirdness because then we would have two ways to provide options (-c "foo bar=baz" vs. -c foo -o bar=baz).

I don't want to just drop the current way of doing things because I like the idea of squash generally working like other compression tools (gzip, bzip2, xz, etc.). A couple options come to mind:

Implementation

This is all actually pretty straightforward to implement.

inikep commented 8 years ago

In FreeArc you can use many combinations of preprocessors. Maybe it will help you with designing your implementation: http://freearc.org/FreeArc036-eng.htm (and search for "preprocessor").

nemequ commented 8 years ago

I definitely prefer GStreamer's syntax to FreeArc's, and it is much more widely known. As for code, based on what I've seen of Tornado I have no desire to even look at FreeArc's code (even if it weren't GPL).

The only interesting thing about FreeArc's pipelines that I've seen is that they support running different files through different preprocessors, but that isn't something that fits in a single-file compressor like Squash.

Bulat-Ziganshin commented 7 years ago

FreeArc license isn't the GPL, it's a sort of "look but don't touch" :)