regular / unbzip2-stream

streaming unbzip2 implementatio in pure javascript for node and browsers
Other
29 stars 23 forks source link

Dinosaur error when unpacking large bzip2 file #30

Closed wouterbeek closed 4 years ago

wouterbeek commented 4 years ago

When unpacking a large bzip2 file (https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2) I get the following error:

rawr i\'m a dinosaur

This is the same error that was earlier reported and fixed in #10 .

Can somebody reproduce this error on their end? It's a big file and it may take a while before the error occurs. I'm using version 1.3.3 of the library.

sfriesel commented 4 years ago

I managed to reproduce the error after about 13.6 GB in (the offending block starts at byte 13667225418). I'm still investigating the root cause.

Here's the block triggering the error, preceeded by a bzip2 file header for faster debugging: reduced.zip

wouterbeek commented 4 years ago

Thanks for making this easier to reproduce. When I use bunzip2 in the command-line (http://www.bzip.org) I also get an error:

$ bunzip2 ~/Downloads/reduced.bz2 ~/tmp/

bunzip2: Compressed file ends unexpectedly;
    perhaps it is corrupted?  *Possible* reason follows.
bunzip2: No such file or directory
    Input file = /home/wouter/Downloads/reduced.bz2, output file = /home/wouter/Downloads/reduced

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

bunzip2: Deleting output file /home/wouter/Downloads/reduced, if it exists.
bunzip2: WARNING: some files have not been processed:
bunzip2:    2 specified on command line, 1 not processed yet.

Could the Wikidata download be corrupted?

sfriesel commented 4 years ago

@wouterbeek The reduced file does not include a valid end-of-stream marker, it's just a testing harness to trigger the error. I didn't run the whole 60 GB through the C implementation but it decompressed fine there, well past the location that throws the dinosaur in unbzip2-stream.

I'm currently suspecting the problem is in the huffman table construction.

sfriesel commented 4 years ago

@wouterbeek the linked PR should fix the problem. Could you run it against the whole file if you have the time?

wouterbeek commented 4 years ago

@sfriesel After having downloaded the Wikidata Bzip2 file locally, I ran into the following issue using the above code:

internal/streams/legacy.js:61
      throw er; // Unhandled stream error in pipe.
      ^
Error
    at new Bzip2Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:31:19)
    at Object.Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:36:37)
    at Object.bzip2.decompress (/home/t/forks/unbzip2-stream/lib/bzip2.js:300:46)
    at decompressBlock (/home/t/forks/unbzip2-stream/index.js:30:29)
    at decompressAndQueue (/home/t/forks/unbzip2-stream/index.js:47:20)
    at Stream.write (/home/t/forks/unbzip2-stream/index.js:76:17)
    at Stream.stream.write (/home/t/forks/unbzip2-stream/node_modules/through/index.js:26:11)
    at ReadStream.ondata (_stream_readable.js:714:22)
    at ReadStream.emit (events.js:311:20)
    at ReadStream.Readable.read (_stream_readable.js:512:10) {
  name: 'Bzip2Error',
  message: 'Boom.',
  stack: 'Error\n' +
    '    at new Bzip2Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:31:19)\n' +
    '    at Object.Error (/home/t/forks/unbzip2-stream/lib/bzip2.js:36:37)\n' +
    '    at Object.bzip2.decompress (/home/t/forks/unbzip2-stream/lib/bzip2.js:300:46)\n' +
    '    at decompressBlock (/home/t/forks/unbzip2-stream/index.js:30:29)\n' +
    '    at decompressAndQueue (/home/t/forks/unbzip2-stream/index.js:47:20)\n' +
    '    at Stream.write (/home/t/forks/unbzip2-stream/index.js:76:17)\n' +
    '    at Stream.stream.write (/home/t/forks/unbzip2-stream/node_modules/through/index.js:26:11)\n' +
    '    at ReadStream.ondata (_stream_readable.js:714:22)\n' +
    '    at ReadStream.emit (events.js:311:20)\n' +
    '    at ReadStream.Readable.read (_stream_readable.js:512:10)'
}
sfriesel commented 4 years ago

@wouterbeek I bit the bullet and downloaded the whole thing too. But it decompressed successfully here. Did the file maybe change in the meantime? Mine has a SHA1 of c0e587926ff394c18c8082b33e489d29a8d8a99f.

wouterbeek commented 4 years ago

@sfriesel Thanks, I tried again and the file now unpacks successfully with the above patch. Thanks for fixing!

regular commented 4 years ago

Published as 1.4.1