rust-lang / flate2-rs

DEFLATE, gzip, and zlib bindings for Rust
https://docs.rs/flate2
Apache License 2.0
891 stars 158 forks source link

Unable to properly read Gzipped access.log file #301

Closed dmilith closed 1 year ago

dmilith commented 2 years ago

To make a long story short… I used code examples to load text from Nginx gzipped log file… But 3k lines of text are completely gone after loading it via GzDecoder.

So I wrote a test case for that issue… and it confirmed that the output is not complete (missing 3k lines). After removing the GzDecoder and loading plaintext access.log whole the input is fine.

#[test]
fn decode_file_test() {
    let access_log = Config::access_log();
    let decoded_log = File::open(&access_log).and_then(decode_file);
    let maybe_log = decoded_log
        .map(|input_contents| {
            String::from_utf8(input_contents)
                .unwrap_or_default()
                .split('\n')
                .filter_map(|line| {
                    if line.is_empty() || is_partial(line) {
                        None
                    } else {
                        Some(line.to_string())
                    }
                })
                .collect::<Vec<_>>()
        })
        .unwrap_or_default();

    let mut file = OpenOptions::new()
        .create(true)
        .write(true)
        .open("log1.log")
        .expect("log1.log has to be writable!");
    file.write_all(maybe_log.join("\n").as_bytes())
        .expect("Couldn't write log1.log file!!");

    assert_eq!(maybe_log.len(), 407166);
}

My decode file function basically does this:

fn decode_file(mut file: File) -> io::Result<Vec<u8>> {
    let mut buf = vec![];
    match file.read_to_end(&mut buf) {
        Ok(bytes_read) => {
            info!("Input file read bytes: {bytes_read}");
            let mut gzipper = GzDecoder::new(&*buf);
            let mut output_buf = vec![];
            gzipper.read_to_end(&mut output_buf)?;
            Ok(output_buf)
        }
        Err(err) => Err(err),
    }
}

The access.log is a standard Nginx access log, with over 60MiBs of text inside.

tatsuya6502 commented 2 years ago

I ran into the same issue. I tried to decode a database dump from Wikipedia, but I got only the first line (7 bytes) decoded by the read method of GzDecoder. The file is in XML format and has ~2 million lines. It is compressed by gzip.

Here is the reproducible code:

Cargo.toml

[dependencies]
flate2 = "1.0.24"
# flate2 = { version = "1.0.24", default-features = false, features = ["zlib-ng"] }

src/main.rs

use flate2::read::GzDecoder;
use std::{fs::File, io::prelude::*};

// Download the file from:
// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz
//
// The size of the file will be 20.4MiB.
//
const DATA_FILE: &str = "./data/enwiki-20220801-abstract27.xml.gz";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open(DATA_FILE)?;
    let mut decoder = GzDecoder::new(file);
    let mut buf = vec![0u8; 1024];

    for _ in 0..5 {
        match decoder.read(&mut buf) {
            Ok(n) => println!("{} bytes: {:?}", n, String::from_utf8_lossy(&buf[..n])),
            Err(e) => {
                eprintln!("Error: {}", e);
                break;
            }
        }
    }

    Ok(())
}

Download the file from https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz (20.4MiB), and place it in the data directory.

You can use gunzip command to see first few lines of the file:

$ gunzip -dkc data/enwiki-20220801-abstract27.xml.gz | head
<feed>
<doc>
<title>Wikipedia: Kalkandereh</title>
<url>https://en.wikipedia.org/wiki/Kalkandereh</url>
<abstract>Kalkandereh may refer to:</abstract>
<links>
<sublink linktype="nav"><anchor>All article disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_article_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>All disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages with short descriptions</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages_with_short_descriptions</link></sublink>

When you run the program, you will get the following output, showing only the first line of the decompressed file:

7 bytes: "<feed>\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""

Note that the buffer can contain up to 1,024 bytes, not 7 bytes.

    let mut buf = vec![0u8; 1024];

    for _ in 0..5 {
        match decoder.read(&mut buf) {

I also tried zlib-ng backend, but it did not solve the issue.

I found that the same program can decode other gzip files. For example, it can decode the file I created from src/main.rs.

$ gzip -k src/main.rs
$ mv src/main.rs.gz data/
746 bytes: "use flate2::read::GzDecoder;\nuse std::{fs::File, io::prelude::*};\n\n// Download this file from:\n// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz\n//\n// The size of the file will be 20.4MiB.\n//\nconst DATA_FILE: &str = \"./data/enwiki-20220801-abstract27.xml.gz\";\n\nfn main() -> Result<(), Box<dyn std::error::Error>> {\n    let file = File::open(DATA_FILE)?;\n    let mut decoder = GzDecoder::new(file);\n    let mut buf = vec![0u8; 1024];\n\n    for _ in 0..5 {\n        match decoder.read(&mut buf) {\n            Ok(n) => println!(\"{} bytes: {:?}\", n, String::from_utf8_lossy(&buf[..n])),\n            Err(e) => {\n                eprintln!(\"Error: {}\", e);\n                break;\n            }\n        }\n    }\n\n    Ok(())\n}\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""

Environment

tatsuya6502 commented 2 years ago

Somebody told me that he was able to read the all contents of the Wikipedia database dump by replacing GzDecoder with MultiGzDecoder. I confirmed it by myself.

The download page of the dumps does not tell if this .gz file has multiple streams, but tells other .bz2 files have multiple streams. So it seems I should have used MultiGzDecoder and my comment above is invalid.

@dmilith — If you have chance, can you please check whether MultiGzDecoder can read your Nginx log files or not? Thanks!

oyvindln commented 2 years ago

Given that this seems to come up quite a bit so we might want to add a note about it on the gzdecoder docs.