Filename header missing and cannot read past the first couple lines of a gzip file

rtyler commented 3 years ago

I was working with a dump of wikipedia data: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz (~760M) and found that flate2 could not read the file properly.

When I would gunzip and then re-gzip the file (e.g. gunzip -c enwiki-latest-abstract.xml.gz | gzip > enwiki-latest-abstract.xml.gz) the Rust code would read the file properly.

I'm really not sure what magic bytes flate2 might not be handling properly, as far as I can tell it's a well formed gzip file :shrug:

My code looks something like this:

        use std::io::BufReader;
        let file = File::open(gzip_xml)?;
        let gz = GzDecoder::new(BufReader::new(file));
        println!("header: {:?}", gz.header());
        let mut reader = BufReader::new(gz);
        let mut line = String::new();

        use std::io::BufRead;
        let mut count = 0;
        loop {
            if count > 10 {
                break;
            }
            count += 1;
            let mut line = String::new();
            let len = reader.read_line(&mut line)?;
            println!("{} - {}", len, line);
        }

What I noticed was interesting is in the original downloaded file, the "header" that I printed had an empty filename field, the re-gzipped archive has that however. :confused:

jszwedko commented 3 years ago

I ran into this same issue using gzip'd logs created by AWS's load balancing logging to S3.

In my case, I see the same behavior that:

using gunzip and then gzip to repack results in the file being correctly processed

the file is missing the filename header

header: Some(GzHeader { extra: None, filename: None, comment: None, operating_system: 3, mtime: 0 })

In my case, it is able to correctly read the first two lines, but then any subsequent reads just return 0 bytes.

I was able to manually craft a gzip file with no filename in the header via gzip --no-filename, but this was handled fine so I think it is something else specific to the input file. The file has sensitive information in it or I'd post it here.

In case this is helpful:

➜  Downloads gzip -v -l -t test.log.gz
method  crc     date  time    compressed uncompressed  ratio uncompressed_name
defla 80219a5c Apr  9 11:58     17380300        47715 -99.9% test.log
test.log.gz:      NOT OK
➜  Downloads gzip --version
Apple gzip 287.100.2

Oddly it says "NOT OK" when I use -l but just -v -t results in "OK". Regardless gunzip is able to decompress it fine. The ratio being negative is quite odd.

Interesting gzip on Debian gives me something different:

root@37e95f7c8a2f:/# gzip -l -v -t /tmp/test.log.gz
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 80219a5c Apr  9 15:58            17380300               47715 -36325.2% /tmp/test.log
root@37e95f7c8a2f:/# gzip --version
gzip 1.6
Copyright (C) 2007, 2010, 2011 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Written by Jean-loup Gailly.

Again the ratio makes no sense.

jszwedko commented 3 years ago

I was able to create a get a test file created by AWS's ALB logging that fails. The behavior I see is that it is able to read the first 8 lines, but there are 12 lines in the file.

071959437513_elasticloadbalancing_us-east-1_app.jesse-test-balancer.622bd5733e76cea4_20201021T2330Z_52.86.86.103_42ctvacv.log.gz

Source I used to test:

fn main() {
    let gzip_xml = "/Users/jesse.szwedko/Downloads/071959437513_elasticloadbalancing_us-east-1_app.jesse-test-balancer.622bd5733e76cea4_20210319T0010Z_54.161.48.252_3m12q1pi.log.gz";
    use std::io::BufRead;
    use std::io::BufReader;
    let file = std::fs::File::open(gzip_xml).unwrap();
    let gz = flate2::read::GzDecoder::new(BufReader::new(file));
    println!("header: {:?}", gz.header());
    let mut reader = BufReader::new(gz);

    let mut count = 0;
    loop {
        count += 1;
        let mut line = String::new();
        match reader.read_line(&mut line) {
            Ok(len) => {
                if len == 0 {
                    break;
                }
                print!("{} - {}", len, line);
            }
            Err(err) => {
                println!("{}", err);
                break;
            }
        }
    }
}

rtyler commented 3 years ago

@jszwedko Yes! That's the exact same behavior I saw, the first couple lines read in and then null bytes! Thank you for the taking the time to come up with a demonstration case that doesn't require a full wikipedia data dump :smile_cat:

alexcrichton commented 3 years ago

IIRC wikipedia entries need to be decoded with MultiGzDecoder, but I'm not sure if there's something else special about wikipedia entries.

jszwedko commented 3 years ago

@alexcrichton hmm, this seems to be the case for me too. MultiGzDecoder causes it to print all of the lines. Is it safe to just always use MultiGzDecoder if we aren't sure ahead of time if the file is multipart or not?

alexcrichton commented 3 years ago

I believe so, yes.

rtyler commented 3 years ago

Very interesting @alexcrichton thanks for pointing this out. If it's safe to use MultiGzDecoderfor any time of `.gz1 file, why bother with two different decoder structs?

alexcrichton commented 3 years ago

It's a theoretical slight performance loss in that decoders must check when streams finish if the next bytes are a new stream

rtyler commented 3 years ago

I don't see a need for this issue any longer. Closing.

rust-lang / flate2-rs

Filename header missing and cannot read past the first couple lines of a gzip file #265