rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
929 stars 163 forks source link

Gzip round-trip fails #357

Closed drtconway closed 1 year ago

drtconway commented 1 year ago

I'm fairly new with Rust, though I'm an experienced programmer, so it may be that I've failed to understand some rust-ism correctly.

I believe the following code should compress correctly compress the vector of integers into a string and then decompress them. Various cut down examples of this code succeed, so I'm a bit perplexed as to why it fails on this data.

I'm suspecting my code is wrong, not the library, but if so, I can't see why.

Also from my Cargo.toml:

flate2 = "1.0.26"
use std::io::{Read, Write};

pub fn read_u64<T>(src: &mut T) -> std::io::Result<u64>
where
    T: Read,
{
    let mut buf: [u8; 8] = [0, 0, 0, 0, 0, 0, 0, 0];
    src.read_exact(&mut buf)?;
    let x: u64 = u64::from_be_bytes(buf);
    Ok(x)
}

pub fn write_u64<T>(out: &mut T, x: u64) -> std::io::Result<()>
where
    T: Write,
{
    let buf: [u8; 8] = x.to_be_bytes();
    out.write(&buf)?;
    Ok(())
}

pub fn read_u64_vec<T>(src: &mut T) -> std::io::Result<Vec<u64>>
where
    T: Read,
{
    let n = read_u64(src)? as usize;
    println!("read n={}", n);
    let mut v: Vec<u64> = Vec::new();
    v.reserve(n);
    for _i in 0..n {
        let x = read_u64(src)?;
        v.push(x);
    }
    Ok(v)
}
pub fn write_u64_vec<T>(out: &mut T, xs: &[u64]) -> std::io::Result<()>
where
    T: Write,
{
    let n: u64 = xs.len() as u64;
    println!("writing n={}", n);
    write_u64(out, n)?;
    for x in xs {
        write_u64(out, *x)?;
    }
    Ok(())
}

#[cfg(test)]
mod tests {
    use std::{io::{BufReader, Cursor}, fs::File};

    use flate2::{read::MultiGzDecoder, write::GzEncoder, Compression};

    use super::*;

    fn load_data() -> Vec<u64> {
        let mut f = File::open("integers.txt").expect("open failed");
        let mut buf = String::new();
        f.read_to_string(&mut buf).expect("read failed");
        let xs: Vec<u64> = buf.lines().map(|s| s.parse::<u64>().expect("not an integer")).collect();
        xs
    }

    #[test]
    fn test_it() {
        let xs = load_data();

        let mut t = GzEncoder::new(Vec::new(), Compression::default());
        write_u64_vec(&mut t, &xs).expect("write vec failed");
        t.flush().expect("flush failed");
        let s: Vec<u8> = t.finish().unwrap();

        let v = Cursor::new(s);
        let w = MultiGzDecoder::new(v);
        let mut r = BufReader::new(w);
        let ys = read_u64_vec(&mut r).expect("read vec failed");
        assert_eq!(xs.len(), ys.len());
    }
}

integers.txt

Byron commented 1 year ago

Without knowing what the problem is, I suspect the MultiGzDecoder to be unnecessary and would rather use a GzDecoder instead. If that still doesn't work, it should be possible to replace all usages of Gz* with a Vec to which to write and from which to read. That should make it possible to validate the data written more easily and see if that works as expected. If not, that should help to figure out where the issue is.

jongiddy commented 1 year ago

In the function write_64 use out.write_all(&buf)?; in place of out.write(&buf)?; to fix the error.

write returns how many bytes were successfully written. If this is less than the number of bytes sent, you need to call write again with the unwritten part of the buffer. write_all handles this looping for you.

drtconway commented 1 year ago

Thanks. I was pretty confident it was an error on my part, and glad to be told so! :)

(BTW, the MultiGzDecoder is a result of this being cut down from larger code that has to deal with reading various bioinformatics files which are compressed with bgzip and which have multiple streams.)