rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
892 stars 158 forks source link

Partial file decompress support #239

Closed ryanvade closed 4 years ago

ryanvade commented 4 years ago

I'm working on an application that pulls parts of compressed files out of S3. I don't want to pull the entire file out of s3 due to file sizes. I should be able to pass part of the file to GzDecode and decompress just the chunk.

let client = S3Client::new(Region::default());
let req = GetObjectRequest {
    bucket: self.bucket.clone(),
    key: self.key.clone(),
    version_id: Some(self.version_id.clone()),
    part_number: Some(part),
    ..Default::default()
};

let response = client.get_object(req).await;
if response.is_err() {
    let err = response.err().unwrap();
    error!("{}", err);
    panic!("Unable to fetch object from S3");
}
let response = response.unwrap();
// More Code Here...

let body = response.body.unwrap();
let mut buff = BytesMut::with_capacity(512);
match body.into_async_read().read_buf(&mut buff).await {
// More Code Here...
let frozen_bytes = buff.to_vec();
let mut deflater = GzDecoder::new(&frozen_bytes[..]);
let mut s = String::new();
let read_response_length = deflater.read_to_string(&mut s);

However, I get a corrupt deflate stream error. Is it not possible to pass only part of a gzip compressed file to the GzDecoder?

alexcrichton commented 4 years ago

AFAIK zlib-based streams (including gzip) don't support seeking or resumption in the middle of the stream. I don't think that this is possible to implement in this library

ryanvade commented 4 years ago

Really? I've been able to do this with the Zlib package for Python. I'm trying to decompress chunks from the start of the object in order, not random seeking.

https://docs.python.org/3/library/zlib.html#zlib.decompressobj


def stream_zlib_decompress(stream):
    # offset 32 to skip the header
    dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
    for chunk in stream:
        rv = dec.decompress(chunk)
        if rv:
            yield rv

stream = s3_object.get(PartNumber=1)
stream = stream.get("Body")
part = b""
for data in stream_zlib_decompress(stream):
    part = part + data
    if len(part) >= 1024:
        break
alexcrichton commented 4 years ago

That's supported through the Decompress object, but the window bits are not currently exported as a parameter. If you need that then it should be easy enough to add a constructor for it!

ryanvade commented 4 years ago

It doesn't seem I can set the window bits to +47 according to https://github.com/alexcrichton/flate2-rs/blob/5ef87027cf9a9a6c876886279f74215c7965a902/src/mem.rs#L349

alexcrichton commented 4 years ago

Ah true! AFAIK no one's really tinkered with that historically. If the underlying C implementation supports other values of window_bits then we probably just need to update the assertion.

ryanvade commented 4 years ago

According to the Python Zlib docs here are the supported values:

As a side note, I have been trying this out with the following:

const b:&[u8] = b"\x1f\x8b\x08\0\0\0\0\0\0\0\xec\xbd\xd9r\xe4X\x926v\r=....";
let mut decompress = Decompress::new_with_window_bits(true, 15); // prefer 47
let mut buf = Vec::new();
let resp = decompress.decompress(&b, &mut buf, FlushDecompress::None);

// Check for errors and such here

Does this test make sense?

ryanvade commented 4 years ago

Also, on this line https://github.com/alexcrichton/flate2-rs/blob/master/src/ffi/rust.rs#L53 new_boxed is being used instead of new_boxed_with_window_bits. Basically the window_bits are unused, not to mention that in new_boxed_with_window_bits the window_bits type is i32 but in Inflate::make its u8.

Edit: noticed this is for the rust backend not c backend

alexcrichton commented 4 years ago

For the Rust backend that's expected because that's translated from miniz which doesn't support different values of window bits. Only the zlib C backends support different values of window bits, which is why the public constructor is also gated behind that feature

ryanvade commented 4 years ago

Indeed, in that case according to https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623f/zlib.h#L832-L882 the c backend should support window bits between -15..47 . Perhaps that will fix my issues.

ryanvade commented 4 years ago

Allowing window bits of 47 removes the deflate decompression error I was receiving, but I end up with an empty output buffer.

oyvindln commented 4 years ago

The only thing decompressing with a smaller window size means in practice is that the decompressor will error out if the data is compressed with a larger window size and has matches that are outside the window the decompressor used. It will affect compression since it limits how far back the compressor will look for matches. zlib is old, and in a very memory-starved environment it made sense to have the option to have a smaller window using less memory if a 32k buffer was too much, but other implementations like miniz didn't bother with implementing that. Adding extra window_bits values won't help you decode partial streams, it's just another way of telling zlib what headers to look for.

What you want to do for partial decompression is to add some parameter to skip crc or zlib validation when ending decompression. I think read_to_string will try to read until failure, so maybe you can do it with the current library by doing read()) calls manually instead (similar to what your python impl does.)

ryanvade commented 4 years ago

I'll try that solution, setting the window size to > 15 is really only useful for 15 + 16 to force only gzip while 15 + 32 is nice for automatic header detection.

ryanvade commented 4 years ago

I was able to solve this with read calls on the Decompress struct.