Files with multiple frames have frames skipped?

indygreg / python-zstandard

Python bindings to the Zstandard (zstd) compression library

BSD 3-Clause "New" or "Revised" License

512 stars 90 forks source link

Files with multiple frames have frames skipped? #59

Open razeh opened 6 years ago

razeh commented 6 years ago

I've ran into a use case for files that with multiple frames. It looks like only the first frame is read. The zstd command line tool has no problems decompressing the file, but when I use a read_to_iter or ZstdDecompressor.decompress call only the first frame is returned.

I've uploaded an example here where the two_frame.zst file has one frame per line.

indygreg commented 6 years ago

Support for decoding multiple frames is a legitimate feature request. The question becomes what the default behavior should be and what controls should be present to influence it.

Since the official tools transparently handle multiple frames, I'm leaning towards that being the default for python-zstandard - at least for streaming operations. We will probably want arguments to control that behavior however.

There are also use cases where consumers want to know where a logical frame ended and another began. So we need to consider how APIs will behave in the presence of multiple input frames.

As a workaround, you can construct a new ZstdDecompressor instance to handle subsequent frames. This is a bit less efficient than reusing an existing instance. But it should work. 0.10's relaxed requirements around the use of context managers should make the code more tolerable.

c-wicklein commented 6 years ago

This issue has some commonality with #29 in that although I'm only trying to process one frame at a time, I still don't know where the frame boundaries occur. I submit all the data I have, take the first frame, take back any unused data, and continue appending subsequent data to that unused data before submitting the buffer again. I do this with the same ZstdDecompressor instance over time.

indygreg commented 5 years ago

The master branch now has support for reading across frames when using the ZstdDecompressor.stream_reader() interface. Behavior of reading across frames can be defined via a read_across_frames argument. When true, read() can return data spanning multiple frames. When not, it will stop at end of frame.

There's still a bit of a ways to go. e.g. we don't yet have support for spanning multiple frames for other decompression APIs. And, we don't yet expose exact input counts, so you don't know exactly where in the input stream the frame boundary occurred. But it is a start.

markopy commented 3 years ago

I have a use case which is related to this ticket and #29. My goal is to decompress a file with multiple frames and create a search index pointing to the relevant compressed frame.

The read_across_frames=False option allows me to read only one frame but it doesn't return the exact length of the compressed frame. Also because the input stream is read in read_size chunks it's impossible to decompress any subsequent frames by calling ZstdDecompressor.stream_reader() again.

Right now I'm pre-scanning the input file myself to find frame boundaries and then decompressing them one by one but this has a lot of overhead for small frames. It would be great to have an api which can read a stream and deliver each decompressed frame along with its location and size in the input stream.

embg commented 2 years ago

Hi, I am still seeing this issue on 0.17.0:

>>> zstandard.decompress(zstandard.compress(b"foo") + zstandard.compress(b"bar"))
b'foo'

This is inconsistent with the C API, which will decompress all frames. It is also inconsistent with gzip:

>>> gzip.decompress(gzip.compress(b"foo") + gzip.compress(b"bar"))
b'foobar'

Finally, this behavior is inconsistent with the zstandard format specification (RFC8878):

3.1.Frames Zstandard compressed data is made up of one or more frames. Each frame is independent and can be decompressed independently of other frames. The decompressed content of multiple concatenated frames is the concatenation of each frame's decompressed content.

Can you please change the default behavior to match the C API, which decompresses all frames?

(Thanks @thatch for alerting me to this issue.)