GzDecoder stops decoding file toward the start.

nschuessler commented 1 year ago

In trying to decode the common crawl index files. GzDecoder stops at about 1.8M of input of a 690M file. The file is too large to use .read_to_end (i.e. read it into memory).

If you download the file and use gzip -d cdx-00010.gz the whole file is expanded. How do you use GzDecoder to get the same behavior as gzip -d?

The code exits early because decoder.Read returns 0 bytes, whereas reading from the stream (input_stream.Read) will continue. So, I assume there is some format issue in the file that GzDecoder does not handle and gzip does. It prints 'Read 0 x' before exiting so I assume there are no errors.

Thanks

Example input: https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00010.gz

Example code:

 let mut file = File::open("cdx-00010.gz").expect("Could not open index file.");
 decode_to_stream(&mut file);

use std::io::prelude::*;
use std::io;
use std::io::BufReader;
use std::fs::File;
use flate2::read::{GzDecoder};

pub fn decode_to_stream(input_stream: &mut dyn Read)
{
    let mut output_file = File::create("decoded").expect("Could not create output file.");
    let mut decoder = GzDecoder::new(input_stream);
    let mut buffer = [0; 65536];
    let mut total_read = 0;
    while let Ok(read_size) = decoder.read(&mut buffer[..])
    {
        println!("Read {} ({}).", read_size, total_read);
        if read_size <= 0 {
            break;
        }

        output_file.write(&buffer[..read_size]);
        total_read = total_read + read_size;
    }
}

nschuessler commented 1 year ago

So it appears this is a multi-member gzip format and requires MultiGzipDecoder.

Byron commented 1 year ago

Sorry for the late reply, and thanks for sharing!

We are currently working on improving the documentation around the usage of GzDecoder and MultiGzDecoder in the hopes that this will be less of a problem in future.

Closing, as this PR is not directly actionable.

rust-lang / flate2-rs

GzDecoder stops decoding file toward the start. #339