Closed wouterbeek closed 4 years ago
I managed to reproduce the error after about 13.6 GB in (the offending block starts at byte 13667225418). I'm still investigating the root cause.
Here's the block triggering the error, preceeded by a bzip2 file header for faster debugging: reduced.zip
Thanks for making this easier to reproduce. When I use bunzip2
in the command-line (http://www.bzip.org) I also get an error:
$ bunzip2 ~/Downloads/reduced.bz2 ~/tmp/
bunzip2: Compressed file ends unexpectedly;
perhaps it is corrupted? *Possible* reason follows.
bunzip2: No such file or directory
Input file = /home/wouter/Downloads/reduced.bz2, output file = /home/wouter/Downloads/reduced
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
bunzip2: Deleting output file /home/wouter/Downloads/reduced, if it exists.
bunzip2: WARNING: some files have not been processed:
bunzip2: 2 specified on command line, 1 not processed yet.
Could the Wikidata download be corrupted?
@wouterbeek The reduced file does not include a valid end-of-stream marker, it's just a testing harness to trigger the error. I didn't run the whole 60 GB through the C implementation but it decompressed fine there, well past the location that throws the dinosaur in unbzip2-stream.
I'm currently suspecting the problem is in the huffman table construction.
@wouterbeek the linked PR should fix the problem. Could you run it against the whole file if you have the time?
@sfriesel After having downloaded the Wikidata Bzip2 file locally, I ran into the following issue using the above code:
internal/streams/legacy.js:61
throw er; // Unhandled stream error in pipe.
^
Error
at new Bzip2Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:31:19)
at Object.Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:36:37)
at Object.bzip2.decompress (/home/t/forks/unbzip2-stream/lib/bzip2.js:300:46)
at decompressBlock (/home/t/forks/unbzip2-stream/index.js:30:29)
at decompressAndQueue (/home/t/forks/unbzip2-stream/index.js:47:20)
at Stream.write (/home/t/forks/unbzip2-stream/index.js:76:17)
at Stream.stream.write (/home/t/forks/unbzip2-stream/node_modules/through/index.js:26:11)
at ReadStream.ondata (_stream_readable.js:714:22)
at ReadStream.emit (events.js:311:20)
at ReadStream.Readable.read (_stream_readable.js:512:10) {
name: 'Bzip2Error',
message: 'Boom.',
stack: 'Error\n' +
' at new Bzip2Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:31:19)\n' +
' at Object.Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:36:37)\n' +
' at Object.bzip2.decompress (/home/t/forks/unbzip2-stream/lib/bzip2.js:300:46)\n' +
' at decompressBlock (/home/t/forks/unbzip2-stream/index.js:30:29)\n' +
' at decompressAndQueue (/home/t/forks/unbzip2-stream/index.js:47:20)\n' +
' at Stream.write (/home/t/forks/unbzip2-stream/index.js:76:17)\n' +
' at Stream.stream.write (/home/t/forks/unbzip2-stream/node_modules/through/index.js:26:11)\n' +
' at ReadStream.ondata (_stream_readable.js:714:22)\n' +
' at ReadStream.emit (events.js:311:20)\n' +
' at ReadStream.Readable.read (_stream_readable.js:512:10)'
}
@wouterbeek I bit the bullet and downloaded the whole thing too. But it decompressed successfully here. Did the file maybe change in the meantime? Mine has a SHA1 of c0e587926ff394c18c8082b33e489d29a8d8a99f.
@sfriesel Thanks, I tried again and the file now unpacks successfully with the above patch. Thanks for fixing!
Published as 1.4.1
When unpacking a large bzip2 file (https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2) I get the following error:
This is the same error that was earlier reported and fixed in #10 .
Can somebody reproduce this error on their end? It's a big file and it may take a while before the error occurs. I'm using version 1.3.3 of the library.