rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
862 stars 159 forks source link

Implemented total_in() and total_out() for GzDecoder, GzEncoder and MultiGzDecoder #382

Open fedy-cz opened 8 months ago

fedy-cz commented 8 months ago

(simple forwarding)

fedy-cz commented 8 months ago

Well. I adapted one of the test from zlib:

#[test]
    fn total_in() {
        let mut real = Vec::new();
        let mut w = write::GzEncoder::new(Vec::new(), Compression::default());
        let v = crate::random_bytes().take(1024).collect::<Vec<_>>();
        for _ in 0..200 {
            let to_write = &v[..thread_rng().gen_range(0..v.len())];
            real.extend(to_write.iter().copied());
            w.write_all(to_write).unwrap();
        }
        let mut result = w.finish().unwrap();

        let result_len = result.len();

        for _ in 0..200 {
            result.extend(v.iter().copied());
        }

        let mut r = read::GzDecoder::new(&result[..]);
        let mut ret = Vec::new();
        r.read_to_end(&mut ret).unwrap();
        assert_eq!(ret, real);
        assert_eq!(r.total_in(), result_len as u64);
    }

... and sure enough it fails.

The question is if it's initial assumptions are correct. Should the total_in() report the current position in the underlying Reader or the compressed data-stream itself (without CRC)? The doc comment is unclear:

    /// Returns the number of bytes that have been read into this compressor.
    ///
    /// Note that not all bytes read from the underlying object may be accounted
    /// for, there may still be some active buffering.

We should decide, document it and then we can write the relevant tests and adapt the code if necessary.

fedy-cz commented 8 months ago

If we decided to indicate the BufRead position (to make the test posted above pass as is) then GzHeaderParser would need to store the header size somewhere (which is currently not the case).

Byron commented 8 months ago

Thanks for giving it a shot!

I would have expected that total_in and total_out is solely something to be forwarded from the underlying implementation, which should keep track of it all by itself, and would not expect it to be emulated by asking the input stream for its position instead.

Remembering my initial question in the related issue #381 as to why these methods are missing, then maybe this is the answer: Straightforward forwarding, for yet to be discovered reason, isn't possible or working as expected.

fedy-cz commented 8 months ago

As I described in the note above (https://github.com/rust-lang/flate2-rs/pull/382#issuecomment-1763450646), it's not all hopeless :smile:

First let me state a few facts:

  1. It is (so far) not an existing API for the Gz* instances and not part of any generic interface (trait) so there is no risk of breaking existing code.
  2. That being said, it is probably not wise to introduce widely different behavior to a similarly named methods within one library.
  3. While (in my opinion) it is not currently fully stated in the documentation, the current implementation of those functions for both zlib::* and deflate::* represent the (relative) position increase in the input/output stream
  4. Ideally you would be able to get this info without depending on this library at all. For example the commonly used BufReader implements the Seek trait with it's stream_position method, but due to some unfortunate decisions in it's design it actually depends on the seek() metod by default and henceforth requires a mutable borrow :disappointed: (some initial attempts to try to rectify that here: https://github.com/rust-lang/rust/issues/59359#issuecomment-1763475845) . While nothing prevents you from implementing an adapter impl Read trait and counting the data passing through yourself, it's another layer of indirection and not very convenient.
  5. While this simple forwarding implementation doesn't meet the expectation in relation to the "contracts" offered by Zlib and Deflate variants, it could still be immensely useful and one could argue it actually offers a better estimation of the compression ratio (by not including the header & footer).

What would I suggest:

  1. Expose these (forwarded) values under some new (descriptive and documented) function names.
    Alternatively: Expose a borrow of the contained DeflateDecoder / DeflateEncoder to the users and let them call these functions directly, but that is probably not very clean (exposing a private member) and also less discover-able.
  2. Implement Gzip header/footer size tracking/reporting. From initial investigation, at least the header is dynamic size (can contain filename ...) Note: It would introduce additional memory requirements to the Decoder / Encoder (probably 2 x u64). Not sure how much these are critical.
  3. Implement the relevant tests and the total_in / total_out methods as expected (using the header / footer size data)

This could be done incrementally and at least point no. 1 shouldn't be too controversial.

What do you think?

Byron commented 8 months ago

Despite being the most active right now, and truth to be told, I feel very uncomfortable to make any API decisions in this crate. Maybe it's possible for you to figure out who wrote what or maybe even most of it, and reach out to them via GitHub or email when lacking a user name or linked PR?

Otherwise I'd actually prefer to just expose the actual compressor (as answer to 1.), but I also think there was probably a good reason not to expose it.

From my own experience (at gitoxide), I never managed to use the high-level abstractions but went right to the underlying Decompressor instances to steer them directly. The question here would be if a higher-level API could also be used if the documentation would be improved or if certain utility functions would be added.

With that said, to be more comfortable with making such decision, I would definitely have to understand what the new API would be good for. Tests can help, and so could full-blown example programs. Thanks for your understanding.

Byron commented 2 months ago

Is this PR still active?