indygreg / python-zstandard

Python bindings to the Zstandard (zstd) compression library
BSD 3-Clause "New" or "Revised" License
496 stars 85 forks source link

Binding to ZSTD_flushStream() for realtime scenarios #85

Open brubbel opened 5 years ago

brubbel commented 5 years ago

Is there currently a binding to ZSTD_flushStream()?

It seems that zstd.FLUSH_BLOCK does allow the decompressor to decode valid data, but not up to the latest data written into the compressor.

indygreg commented 5 years ago

Various compressor types have flush() methods. Search README.rst for flush and you should find relevant documentation.

My understanding is that zstd.FLUSH_BLOCK (which corresponds to ZSTD_e_flush) will ensure any data written to the compressor so far will be decodeable on a decompressor. I even remember speaking with the zstd maintainers to confirm this behavior. If you are not seeing this behavior, it is either a bug in python-zstandard or buffering on the output/input streams outside of python-zstandard could be at fault.

python-zstandard does not call ZSTD_flushStream() directly. As zstd.h says, this function is equivalent to ZSTD_compressStream2(zcs, output, &emptyInput, ZSTD_e_flush). And this should be what some flush() methods are calling.

If you want to audit the source code, I recommend reading cffi.py, as the Python code is a bit easier to comprehend than the C code. The CFFI and C bindings should be functionality equivalent.

brubbel commented 5 years ago

My understanding is that zstd.FLUSH_BLOCK (which corresponds to ZSTD_e_flush) will ensure any data written to the compressor so far will be decodeable on a decompressor.

That is correct. One can start to decompress, but from my tests I have to conclude that not all data is flushed by the compressor. If FLUSH_FRAME is called, it does sync all data but also resets the current dictionary.

This is in contrast to zlib.Z_SYNC_FLUSH, which flushes all data, but allows to continue with the same dictionary.

indygreg commented 5 years ago

Would it be possible for you to articulate your request in terms of zstd C API calls and/or python-zstandard functions? I'm not sure I fully understand what it is you are trying to do. We seem to be talking about the streaming APIs. But dictionaries are also involved. There's enough combinations that I'm not sure exactly what the request is for.