rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
891 stars 158 forks source link

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

Closed dipstef closed 2 years ago

dipstef commented 2 years ago

Hi all!

Apologies if I did miss something out here and the error is on my behalf (however this is not different than any standard usage of this library), I am reading a wikidata dump line by line, and being this a giant json array only the first line containing the opening square bracket is being returned:

The dump is the following: https://dumps.wikimedia.org/wikidatawiki/entities/20220606/wikidata-20220606-all.json.gz

let path = ".../wikidata-20220606-all.json.gz";
let f = File::open(&path).expect("file not found");
let reader = BufReader::new(GzDecoder::new(f));

reader.lines().for_each(|l |{
    println!("{}", l.ok().unwrap());
})

Switching to the loop based format, the second call to read_lines returns 0 bytes, which should be in line with the lines iterator behaviour.

let reader = BufReader::new(GzDecoder::new(f));

let mut buf = String::new();
        while let Ok(usize) = reader.read_line(&mut buf) {
            match usize {
                0 => {
                    buf.clear();
                    break;
                }
                _ => {
                    println!("{}", buf);
                    buf.clear()
                }
            }
        }

No issues when reading the above file from gzcat or a python script.

Any idea on how to troubleshoot this?

Thanks in advance for your help!

alexcrichton commented 2 years ago

I believe for wikipedia dumps you need to use MultiGzDecoder

dipstef commented 2 years ago

Cheers, that did it!

My follow up question seem to be already be addressed in this issue:

https://github.com/rust-lang/flate2-rs/issues/178

So I would rely on usages of MultiGzDecoder instead for arbitrary files.

Cheers,