rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
895 stars 161 forks source link

flate2 ZLibDecoder reads too much data from files. #338

Open kallisti5 opened 1 year ago

kallisti5 commented 1 year ago

The flate2 ZLibDecoder seems to read too much data and advance file pointers too far.

Parsing a raw file filled with individual hunks of zlib compressed data:

I can validate these chunks:

[kallisti5@eris hpkg-rs]$ ls -lah test1.uncompressed 
-rw-r--r-- 1 kallisti5 users 64K Feb 28 09:19 test1.uncompressed
[kallisti5@eris hpkg-rs]$ ls -lah test2.uncompressed 
-rw-r--r-- 1 kallisti5 users 64K Feb 28 09:20 test2.uncompressed

However.. when I try to decode these chunks with flate2 / ZlibDecoder..

       /// Inflate the heap section of a hpkg for later processing
        /// XXX: This will likely need reworked... just trying to figure out what's going on
        fn inflate_heap(&mut self) -> Result<usize, Box<dyn error::Error>> {
                let header = self.header.as_ref().unwrap();
                let filename = self.filename.as_ref().unwrap();

                println!("header: Heap chunk size: {}", header.heap_chunk_size);
                println!("header: Heap compressed size: {}", header.heap_size_compressed);
                println!("header: Heap uncompressed size: {}", header.heap_size_uncompressed);
                println!("header: Heap compression: {}", header.heap_compression);

                let mut f = File::open(filename)?;

                let mut pos = header.header_size as u64;
                f.seek(SeekFrom::Start(pos))?;

                while pos < header.heap_size_compressed {
                        println!("Seek from file {}, heap {}", pos, pos - header.header_size as u64);
                        let mut reader: Box<dyn Read> = match header.heap_compression {
                                0 => Box::new(&f),
                                1 => Box::new(ZlibDecoder::new(&f)),
                                2 => Box::new(zstd::stream::read::Decoder::new(&f)?),
                                _ => return Err(From::from(format!("Unknown hpkg heap compression: {}", header.heap_compression)))
                        };
                        let mut buffer = vec![0; header.heap_chunk_size as usize];
                        reader.read_exact(&mut buffer)?;
                        self.heap_data.push(buffer);
                        pos = (&f).stream_position()?;
                }
                Ok(0)
        }
---- package::tests::test_package_load_valid stdout ----
header: Heap chunk size: 65536     (uncompressed yo)
header: Heap compressed size: 501432
header: Heap uncompressed size: 1988947
header: Heap compression: 1
Seek from file 80, heap 0
Seek from file 32848, heap 32768
ERROR: corrupt deflate stream
thread 'package::tests::test_package_load_valid' panicked at 'assertion failed: false', src/package.rs:297:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

read_exact produces 64k uncompressed as expected, but the file pointer is moved 32k in the file vs the expected 22057 bytes. (22057 marks the "end" of the compressed data stream and the start of the next 0x78, 0xDA)

kallisti5 commented 1 year ago

I feel like the solution is once ZLibDecoder reads the final 32k chunk, it should "back the source reader to the end of the zlib stream"?

This would enable users to know where the last compressed stream ended within a file.

kallisti5 commented 1 year ago

https://github.com/rust-lang/flate2-rs/blob/main/src/deflate/read.rs#L161 feels like the source of the guaranteed 32k read from the files.

jongiddy commented 1 year ago

I think this is the same problem as #367 except that that issue is for gzip. Essentially, this is expected behavior for the read interfaces that actually wrap the Read type in a new std::io::BufReader for each decoder.

To fix, wrap the File in a BufReader once and you can then pass it to multiple bufread::ZlibDecoder instances.

This will, however, also make stream_position incorrect, even if you can access it, so you will need a different way to know when to terminate the loop.