quixdb / squash

Compression abstraction library and utilities
https://quixdb.github.io/squash/
MIT License
406 stars 53 forks source link

Add a stream API for zstd #216

Closed mknjc closed 8 years ago

mknjc commented 8 years ago

Without a stream API the squash command tries to allocate a buffer as big as the input file which isn't a very good idea with big files.

nemequ commented 8 years ago

Thanks! You even signed the CLA, which is awesome.

Unfortunately it looks like this causes some tests to fail; see Travis and AppVeyor. Note that if you target a branch other than "master", like "staging", the tests should run automatically when you file or update the PR.

Without a stream API the squash command tries to allocate a buffer as big as the input file which isn't a very good idea with big files.

FWIW, we do usually mmap the file instead of just mallocing a huge buffer, so unless you're on 32-bit it's probably not that bad. That said, streams are definitely preferred as even on 64-bit platforms we can't always use mmap.

mknjc commented 8 years ago

I'm not sure if cff90b2 is correct but it seams logical to me.

nemequ commented 8 years ago

Sorry it took so long for me to get to this.

I'm not sure if cff90b2 is correct but it seams logical to me.

No, something else is going on here.

Looking at how the return value for decompression is determined in squash_zstd_process_stream, it doesn't really match the documentation in zstd. zstd says that a return value of 0 means "a frame is completely decoded and fully flushed", but you're returning SQUASH_END_OF_STREAM there. Frame != stream; unless I'm mistaken (i.e., assuming zstd is similar to other codecs), streams can contain multiple frames, so it's quite possible that there is more data. AFAICT there is no way to know whether we have actually reached the end of stream or the end of a frame.

Unfortunately, zstd's API is a bit weird. If we try to flush the decompression by passing 0 input bytes to ZSTD_decompressStream after it previously returned 0 (indicating the end of a frame), it returns an error code instead of 0. So, we can either keep track of the previous value (in the SquashZstdStream struct) or handle the error code.

I squashed most of the commits and added a patch so the tests all pass (as well as fixing a warning from ICC). Everything should be working now.

Thanks for the patches. Streaming support for zstd is very nice to have :)

mknjc commented 8 years ago

Thanks for adding, but:

it doesn't really match the documentation in zstd. zstd says that a return value of 0 means "a frame is completely decoded and fully flushed", but you're returning SQUASH_END_OF_STREAM there. Frame != stream;

The documentation says: ZSTD_endStream() instructs to finish a frame. So I would say the words frame and stream are interchangeable. Additionally after flushing a frame in ZSTD_decompressStream set its state to zdss_readwhich set its state to to zdss_init if the frame has ended which does a return ERROR(init_missing); This means after reaching a end of frame the ZSTD_DStreamis unusable after a end of frame another sign zstd's end of frame is really a end of compressed stream

nemequ commented 8 years ago

@Cyan4973, could you confirm that (we're talking about ZSTD_decompressStream)?

Cyan4973 commented 8 years ago

I would rather lean towards @nemequ definition : a stream may contain multiple frames.

The confusion is quite understandable, as in general, a stream consists of a single frame, so they are both equivalent. If you compress a long file with zstd, you will get a single frame as a result.

But in more advanced scenarios, say if you compress a large file with pzstd, you would get multiples frames, all stored into a same file, and interleaved with skippable frames containing metadata (for multi-threaded decompression). Such file can still be decoded by zstd : it simply decodes each frame one by one, skip over skippable ones, and regenerate original data faithfully. It's a good example showing a stream consisting of multiple appended frames.

Another potential source of confusion is the difference between compression and decompression.

Compression is in charge. Compression decides when a frame ends. That's the point of ZSTD_endStream(). Compression could also decide to create multiple frames and append them in a single file or stream.

Decompression read and execute. It doesn't decide when a frame ends, nor if there is any other data to follow. It just gets the job done. Decompression still knows when a frame ends, even in the middle of a stream. That's why it returns a code 0. It allows the decoder to be informed that a frame was just completed, which, for some scenarios, is a relevant signal.

After finishing a frame, there might be some more input data left to decode.

The current API model requires to re-init the decoder context between each frame, using ZSTD_initDStream(). If data is pushed without init, the decoder context is considered uninitialized, which triggers an error. The benefit is that it makes an "end-of-frame" a compulsory signal. The downside is that it requires logic to insert init call.

Other logics could have been implemented. For example, ZSTD_decompressStream() could simply accept more data, transparently issuing an init in the background. The downside is that it's easy to miss the "end of frame" signal.

Or it could also refuse to read more data, and just output 0, saying "current frame is completed", implying "please start a new one with init". The downside is that it's easy to stay stuck in an infinite loop, re-presenting the same data again and again without it being consumed.

All 3 behaviors are acceptable. It's mostly a matter a selecting one.

Cyan4973 commented 8 years ago

If we try to flush the decompression by passing 0 input bytes to ZSTD_decompressStream after it previously returned 0 (indicating the end of a frame)

When ZSTD_decompressStream() returns 0, it means both that input has reached the end of a frame, and that the decoded frame has been entirely flushed. So there is nothing left to flush.

nemequ commented 8 years ago

Interesting. It seems like re-initializing the decoder would be the best way to go as that will let us interoperate with pzstd (as well as anything else that does something similar).

The benefit is that it makes an "end-of-frame" a compulsory signal. The downside is that it requires logic to insert init call.

Why is it important that people see the end of frame signal?

If it is a requirement it would be great if the documentation could be changed to reflect that; I doubt I'll be the only person who will be a bit confused with the current documentation.

Other logics could have been implemented. For example, ZSTD_decompressStream() could simply accept more data, transparently issuing an init in the background. The downside is that it's easy to miss the "end of frame" signal.

I'm not familiar enough with zstd's internals to offer a strong opinion, but it seems to me it would be better to optimize the API for the simple case (which people are more likely to use) as long as it's still possible to write a parallelized decoder (which, I assume, needs that end of frame signal?).

Does flushing during compression (ZSTD_flushStream) cause a frame to end as well?

I see examples/screaming_decompression.c doesn't handle streams with multiple frames; it might be nice to tweak it to do so.

Cyan4973 commented 8 years ago

Does flushing during compression (ZSTD_flushStream) cause a frame to end as well?

No, it causes a block to end, so that data can be flushed immediately. A frame consists of multiple appended blocks. It takes ZSTD_endStream() to end a frame.

Btw : and now I realize that the naming, ZSTD_endStream(), implies that it's ending a stream .... meh ....

I see examples/screaming_decompression.c doesn't handle streams with multiple frames; it might be nice to tweak it to do so.

Good point

Why is it important that people see the end of frame signal?

It depends on the application. In our case, it's an important signal in situations where we have to route frames from different formats. Typically, within just Zstandard, we have several legacy formats to support (v0.4.x, v0.5.x, v0.6.x, v0.7.x). This is now handled transparently within the streaming API, but it used to be manual. Now, the zlib wrapper API goes even farther, and is able to decode zstd and zlib frames mixed within a single file (or stream). pzstd also needs to actually read and decode some skippable frames in order to exploit the metadata it contains.

This is all rather advanced stuff, and I guess most casual users won't need that. Still, the capability to detect and stop at end of frame boundaries is an important feature for advanced scenarios to be possible.

Cyan4973 commented 8 years ago

OK, here is a potential follow up.

If I do remember correctly, an important problem when selecting an API behavior is a lack of feedback, especially when it's about advanced stuff that few people will ever need. Consequently, regarding this specific end-of-frame signal, I selected the most conservative solution, knowing that if it needed to be changed later, it would still be possible. That is, if I relax conditions to start new frames, that's still compatible with a strict initialization at new frame policy. The other direction would have broken existing user codes.

So here is a tentative behavior which seems simpler for "normal" usages, and remains compatible with existing codes : ZSTD_decompressStream() would still stop at end of frame, providing a 0 return code, but after that stage, it is possible to provide even more input data, without requiring to ZSTD_initDStream() first. It effectively makes ZSTD_initDStream() completely optional.

There is still a problem though, when associated with dictionaries. Requiring to init at the beginning of each new frame makes it very clear which dictionary must be used to decode the new frame. So, what now ? I believe the most sensible solution is to re-use previous parameters, hence re-use dictionary previously selected. In effect, it means the API would trigger a ZSTD_resetDStream() transparently in the background.

nemequ commented 8 years ago

@Cyan4973, that all sounds quite reasonable to me.

Sorry, I should have added streaming support to Squash a long time ago, before zstd's API was stable.

Cyan4973 commented 8 years ago

This change of Streaming API behavior will probably be introduced in Zstandard v1.2

nemequ commented 8 years ago

Great, thanks :). I'll probably throw together a patch for Squash to call resetDStream manually soon; no reason not to be compatible with 1.0 if we can.

Cyan4973 commented 7 years ago

For information : in latest dev branch update, the init policy regarding streaming decompression API has been softened. Now, providing more data to ZSTD_decompressStream() after a completed frame will trigger an implicit ZSTD_resetDStream(). It effectively means users can continue feeding more data as long as wanted without any need to init/reset at the end of each frame.

This change will be present in next release.