rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
891 stars 158 forks source link

Replace interior Reader without resetting decompress data #276

Open jnbooth opened 3 years ago

jnbooth commented 3 years ago

If I'm parsing a continuous encoded stream that is (unfortunately) divided up across several successive inputs, my only option appears to be using the Decompress object directly, which is a huge hassle without access to the zio:: functions.

It seems like it would be easy to create a "replace" method identical to the various "reset" methods, but with the reset_decoder_data(self); call removed. That way, I could switch out one Read object for the next while preserving the stream state.

Gelbpunkt commented 2 years ago

I would love this exact same feature but for ZlibEncoder - right now I can't reuse a ZlibEncoder object, but using Compress directly (https://gist.github.com/Gelbpunkt/44a6fb4cfe15f15e4c3757416b7021b1) doesn't work when I reuse the object. It does work fine when using reset on the Compress object before decompressing, but that seems to not preserve encoding parameters. Zlib gives invalid stored block lengths when using the same decoder (in python) to decode multiple payloads that were created by encoding and then resetting the Compress object.

Byron commented 12 months ago

I am probably not following on some intricacies of the problem, but wonder if it's possible to create a custom Write/Read that is Clone and uses shared ownership internally. That way it would be possible to keep a handle to it after passing it to types in this crate, and use interior mutability to swap out any state as you see fit.

I'd find that, in theory, more suitable as first step towards solving this particular problem as it's much less involved. Of course, it also delegates the hard parts of managing the internal decompressor state correctly which certainly wouldn't be trivial, but should be possible.

nc-x commented 3 months ago

Tried swapping out the internal reader for GzDecoder using interior mutability but it does not work.

I don't think this can be done without additional support from the library itself.

Byron commented 3 months ago

Couldn't a multi-member (continuous) stream be accomplished by a MultiGzDecoder.

I think what would really help here is an example program with real (but minimal) input data.

nc-x commented 3 months ago

AFAICT, MultiGzDecoder is for gzip having multiple files inside it which is not the usecase here. The usecase here, for example, is to download huge gzip files using range requests. Reqwest's Response object implements Read, so it can be used with GzDecoder or MultiGzDecoder, but once a given response is decoded you make another range request to get the next piece of data and decode it in a streaming fashion, probably because the data is huge and you want to filter out some stuff or parse it directly etc.

But because every range request yields a different response, an ability to swap the internal reader is required while keeping the internal state sane.

I do have some code for the same but I don't have any public gzip url where i can try this on. Let me see if I can do something about the example code.

the8472 commented 3 months ago

However, when the first reader is completely read, you get an error: corrupt deflate stream (because the stream reader only had partial data).

Perhaps handling this as a WouldBlock error like this test does could work?

https://github.com/rust-lang/flate2-rs/blob/1a0daec607455f30651674675abb01715586f4d1/src/gz/read.rs#L332-L350

Then on error the next chunk can be filled in.

jongiddy commented 3 months ago

Either:

  1. Create your own struct that implements Read. When it has no data it performs the next Range request to fill its buffer. It only returns 0 when there are no more ranges to request. Submit this struct as the reader to read::GzDecoder::new. (I believe this is the solution referred to in Byron's comment above). It is probably more efficient to implement BufRead and use bufread::GzDecoder.
  2. Use a write::GzDecoder. Create a loop to read Range requests and write the returned data to the write::GzDecoder. This is the only solution if the Range requests use async code.
nc-x commented 3 months ago

I have created a repo with an example program: https://github.com/nc-x/flate-276

Create your own struct that implements Read. When it has no data it performs the next Range request to fill its buffer. It only returns 0 when there are no more ranges to request. Submit this struct as the reader to read::GzDecoder::new. (I believe this is the solution referred to in https://github.com/rust-lang/flate2-rs/issues/276#issuecomment-1717083083 above). It is probably more efficient to implement BufRead and use bufread::GzDecoder.

The issue is that when you try reading from GzDecoder, it will simply go past the end of the partial data which it was provided and fail, which is expected because technically the input is incomplete and hence invalid.

So, basically the first reading itself fails and you do not even get to the point where you can swap underlying readers or return 0 etc. AFAICT end users of GzDecoder cannot do anything about this and there needs to be a way tell GzDecoder that it will be given partial data, so it should't fail.

Have not tried your 2nd solution and am busy for a couple of days now but presumably that would have the same issue as well?

jongiddy commented 3 months ago

I made a pull request to your repo to demonstrate the correct solution using a Read implementation.

The Read implementation of Download turns the chunked range requests into a single Read stream returning the compressed data. The GzDecoder is outside Download and only has its reader set once in the new method. It never replaces the reader.

nc-x commented 2 months ago

thanks