Closed dmilith closed 1 year ago
I ran into the same issue. I tried to decode a database dump from Wikipedia, but I got only the first line (7 bytes) decoded by the read
method of GzDecoder
. The file is in XML format and has ~2 million lines. It is compressed by gzip.
Here is the reproducible code:
Cargo.toml
[dependencies]
flate2 = "1.0.24"
# flate2 = { version = "1.0.24", default-features = false, features = ["zlib-ng"] }
src/main.rs
use flate2::read::GzDecoder;
use std::{fs::File, io::prelude::*};
// Download the file from:
// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz
//
// The size of the file will be 20.4MiB.
//
const DATA_FILE: &str = "./data/enwiki-20220801-abstract27.xml.gz";
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open(DATA_FILE)?;
let mut decoder = GzDecoder::new(file);
let mut buf = vec![0u8; 1024];
for _ in 0..5 {
match decoder.read(&mut buf) {
Ok(n) => println!("{} bytes: {:?}", n, String::from_utf8_lossy(&buf[..n])),
Err(e) => {
eprintln!("Error: {}", e);
break;
}
}
}
Ok(())
}
Download the file from https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz (20.4MiB), and place it in the data
directory.
You can use gunzip
command to see first few lines of the file:
$ gunzip -dkc data/enwiki-20220801-abstract27.xml.gz | head
<feed>
<doc>
<title>Wikipedia: Kalkandereh</title>
<url>https://en.wikipedia.org/wiki/Kalkandereh</url>
<abstract>Kalkandereh may refer to:</abstract>
<links>
<sublink linktype="nav"><anchor>All article disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_article_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>All disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages with short descriptions</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages_with_short_descriptions</link></sublink>
When you run the program, you will get the following output, showing only the first line of the decompressed file:
7 bytes: "<feed>\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""
Note that the buffer can contain up to 1,024 bytes, not 7 bytes.
let mut buf = vec![0u8; 1024];
for _ in 0..5 {
match decoder.read(&mut buf) {
I also tried zlib-ng
backend, but it did not solve the issue.
I found that the same program can decode other gzip files. For example, it can decode the file I created from src/main.rs
.
$ gzip -k src/main.rs
$ mv src/main.rs.gz data/
746 bytes: "use flate2::read::GzDecoder;\nuse std::{fs::File, io::prelude::*};\n\n// Download this file from:\n// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz\n//\n// The size of the file will be 20.4MiB.\n//\nconst DATA_FILE: &str = \"./data/enwiki-20220801-abstract27.xml.gz\";\n\nfn main() -> Result<(), Box<dyn std::error::Error>> {\n let file = File::open(DATA_FILE)?;\n let mut decoder = GzDecoder::new(file);\n let mut buf = vec![0u8; 1024];\n\n for _ in 0..5 {\n match decoder.read(&mut buf) {\n Ok(n) => println!(\"{} bytes: {:?}\", n, String::from_utf8_lossy(&buf[..n])),\n Err(e) => {\n eprintln!(\"Error: {}\", e);\n break;\n }\n }\n }\n\n Ok(())\n}\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""
Environment
Somebody told me that he was able to read the all contents of the Wikipedia database dump by replacing GzDecoder
with MultiGzDecoder
. I confirmed it by myself.
The download page of the dumps does not tell if this .gz
file has multiple streams, but tells other .bz2
files have multiple streams. So it seems I should have used MultiGzDecoder
and my comment above is invalid.
@dmilith — If you have chance, can you please check whether MultiGzDecoder
can read your Nginx log files or not? Thanks!
Given that this seems to come up quite a bit so we might want to add a note about it on the gzdecoder docs.
To make a long story short… I used code examples to load text from Nginx gzipped log file… But 3k lines of text are completely gone after loading it via GzDecoder.
So I wrote a test case for that issue… and it confirmed that the output is not complete (missing 3k lines). After removing the GzDecoder and loading plaintext access.log whole the input is fine.
My decode file function basically does this:
The access.log is a standard Nginx access log, with over 60MiBs of text inside.