jviotti commented 8 years ago

I get the following error when attempting to decompress various bzip2 files:

TypeError: Cannot read property '0' of undefined
    at f (/Users/jviotti/Projects/resin/etcher/node_modules/unbzip2-stream/lib/bit_iterator.js:24:34)
    at Object.bzip2.decompress (/Users/jviotti/Projects/resin/etcher/node_modules/unbzip2-stream/lib/bzip2.js:272:13)
    at decompressBlock (/Users/jviotti/Projects/resin/etcher/node_modules/unbzip2-stream/index.js:29:28)
    at decompressAndQueue (/Users/jviotti/Projects/resin/etcher/node_modules/unbzip2-stream/index.js:46:20)
    at Stream.write (/Users/jviotti/Projects/resin/etcher/node_modules/unbzip2-stream/index.js:75:36)
    at Stream.stream.write (/Users/jviotti/Projects/resin/etcher/node_modules/through/index.js:26:11)
    at Through2.ondata (_stream_readable.js:536:20)
    at emitOne (events.js:90:13)
    at Through2.emit (events.js:182:7)
    at readableAddChunk (_stream_readable.js:153:18)
    at Through2.Readable.push (_stream_readable.js:111:10)
    at Through2.Transform.push (_stream_transform.js:128:32)
    at afterTransform (_stream_transform.js:77:12)
    at TransformState.afterTransform (_stream_transform.js:54:12)
    at Through2.write [as _transform] (/Users/jviotti/Projects/resin/etcher/node_modules/progress-stream/index.js:46:3)
    at Through2.Transform._read (_stream_transform.js:167:10)
    at Through2.Transform._write (_stream_transform.js:155:12)
    at doWrite (_stream_writable.js:301:12)
    at writeOrBuffer (_stream_writable.js:287:5)
    at Through2.Writable.write (_stream_writable.js:215:11)
    at ReadStream.ondata (_stream_readable.js:536:20)
    at emitOne (events.js:90:13)
    at ReadStream.emit (events.js:182:7)
    at readableAddChunk (_stream_readable.js:153:18)
    at ReadStream.Readable.push (_stream_readable.js:111:10)
    at onread (fs.js:1818:12)
    at FSReqWrap.wrapper [as oncomplete] (fs.js:614:17)

The issue seems to come from the bit iterator. I was able to reproduce with virtually every bzip2 file I created, at least on OS X.

regular commented 8 years ago

Looks like a duplicate of #3 ? Are you sure your stream is not truncated? Does it decompress without errors when you use bunzip2?

jviotti commented 8 years ago

@regular The stream is not truncated and it decompresses fine with bunzip2. I'm using version:

$ bzip2 -V
bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.

I'm uploading a file that reproduces the issue to Dropbox, will ping you once its done.

jviotti commented 8 years ago

@regular Here you go: https://dl.dropboxusercontent.com/u/18129309/foo.iso.bz2

I can reproduce the issue with the following snippet:

const fs = require('fs');
const unbzip2Stream = require('unbzip2-stream');
const FILE = '/Users/jviotti/Dropbox/Public/foo.iso.bz2';

fs.createReadStream(FILE)
  .pipe(unbzip2Stream())
  .pipe(fs.createWriteStream('/Users/jviotti/Desktop/foo.iso'))
  .on('error', function(error) {
    console.error(error);
    process.exit(1);
  });

regular commented 8 years ago

Ok, I can reproduce the problem. Obviously the stream thinks it has received enough bytes to decompress the next buffer, so it calls decompressAndQueue. However, decompressAndQueue turns out to try to consume more bytes than are in the bufferQueue. I am having a hard time to figure out why. I suspect a bug in either index.js or bit_iterator.js Any help would be appreciated.

jviotti commented 8 years ago

Hi @regular,

I've been having a closer look at this myself, and indeed have no idea why the module tries to consume more bytes than are in the queue.

As a workaround, giving the while conditional in the through2 transformer a bit more room to breathe seem to be enough:

diff --git a/index.js b/index.js
index cdcd3fe..8556968 100644
--- a/index.js
+++ b/index.js
@@ -70,7 +70,7 @@ function unbzip2Stream() {
                     return bufferQueue.shift();
                 });
             }
-            while (hasBytes - bitReader.bytesRead + 1 >= (100000 * blockSize || 4)){
+            while (hasBytes - bitReader.bytesRead + 1 >= (100000 * blockSize * 1.2 || 4)){
                 //console.error('decompressing with', hasBytes - bitReader.bytesRead + 1, 'bytes in buffer');
                 if (!done) done = !decompressAndQueue(this);
                 if (done) break;

Of course, this is just taming the symptoms, but maybe we can merge this with a big TODO warning while keep debugging the root cause.

I've tried the above patch with some bz2 files that previously failed and it seems to work fine. I also compared the resulting decompressed data byte by byte with the output of bunzip2.

jviotti commented 8 years ago

@regular ping!

regular commented 8 years ago

I am really not a fan of this patch. 1.2 is an arbitrary value, as you are aware of. Applying this "fix" would only obscure the root cause. What we need to do is understanding what's really going on. Unfortunately I do not have the resources right now to do so. Maybe you can have a closer look (again)? There's a tool in the bzip2 package that outputs frame boundaries. Maybe that's helpful.

regular commented 8 years ago

Oh, you need it for Etcher? That's an electron-based app, right? Then there's no need to decompress using Javascript at all. There are much faster options using native code out there. This module is really only useful if you need to decompress inside a browser.

jviotti commented 8 years ago

@regular Yeah, fair enough.

Oh, you need it for Etcher? That's an electron-based app, right? Then there's no need to decompress using Javascript at all. There are much faster options using native code out there. This module is really only useful if you need to decompress inside a browser.

I've did some research on that in the past (albeit very little to be honest), however I couldn't find a native bz2 decompression module that worked fine as a transform stream. They either took the whole file as a buffer, or their fibers-based stream examples didn't work as expected (the resulting data was not correct).

Do you have any module you can particularly recommend?

regular commented 8 years ago

Other frustrated users of unbzip2-stream recommend gzbz. sounds like it'll serve you well and be much faster. (although your bottleneck is probably writing to the SD card, so it doesn't really matter)

I am working on fixing unbzip2-stream currently. It will change its API though and will become a pull-stream, that's a better match: you pull data out of the decoder instead of the file pushing data through the decoder. That way the decoder decides when to read how many bits, which will resolve the current issue.

jviotti commented 8 years ago

OK, cool, I'll give gzbz a go. As you correctly stated, the bottleneck is not the decompression, so I don't really mind using a JS implementation.

The pull-stream approach sounds good. I'll try gzbz, and will come back once unbzip2-stream is refactored if I encounter any issues with the other module.

jeromew commented 8 years ago

I encountered the same issue with unbzip2-stream before finding this issue. I hit the issue or not on the same file depending on what seem random facts (I am streaming direcly from the network). Can the bug depend on the size of the chunks that through writes ?

I could not really understand how unbzip2 stops reading bits when the network is not fast enough to fill the parsing frontier. Could it be the problem ? bits() always answers (with this bug) even when we are starving on bytes.

jeromew commented 8 years ago

After looking a bit more at the problem, it seems to me that it comes from the fact that the code thinks we have all the necessary for block as soon as we have > 100kb * blocksize bytes.

You can think that 100kb_bs of badly compressable data will lead to a bit less than 100kb_bs of compressed data, so requesting more data than the volume of data beeing compressed should be sufficient. BUT this heuristic does not take into account the fact that there are block headers to take into account.

According to wikipedia https://en.wikipedia.org/wiki/Bzip2, before the data we have

.compressed_magic:48 = 0x314159265359 (BCD (pi)) .crc:32 = checksum for this block .randomised:1 = 0=>normal, 1=>randomised (deprecated) .origPtr:24 = starting pointer into BWT for after untransform .huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present .huffman_used_bitmaps:0..256 = bitmap, of symbols used, present/not present (multiples of 16) .huffman_groups:3 = 2..6 number of different Huffman tables in use .selectorsused:15 = number of times that the Huffman tables are swapped (each 50 bytes) .selector_list:1..6 = zero-terminated bit runs (0..62) of MTF'ed Huffman table (_selectors_used) .start_huffmanlength:5 = 0..20 starting bit length for Huffman deltas .delta_bitlength:1..40 = 0=>next symbol; 1=>alter length { 1=>decrement length; 0=>increment length } ((symbols+2)*groups)

If I understand correctly, that makes a maximum of

48+32+1+24+16+256+3+6*x2^15+40*20 bits ~ 25000 bytes

so we should have a heuristic with

((25000 + 100000 * blockSize) || 4)

instead of

(100000 * blockSize || 4)

@jviotti could you test this with your file ?

I am not an expert bzip2 that I may have done a wrong count of the maximum number of bytes in the block header.

jeromew commented 8 years ago

@regular I can confirm that after applying #8 and #9 I could successfully make my script work on a remote 200Go. It breaks without the patches.

9 in spirit is very close to what @jviotti was suggesting (give more breathing room) but is supported by a real reason : the maximum compressed block size necessary is `maxBlockHeaderSize + blockSize * 100kb`.

I hope you will have some time to test this and maybe accept the PRs and publish a new version even if you are currently preparing a better, new pull-oriented version.

thanks!

jviotti commented 8 years ago

Hi @jeromew ,

Amazing progress! I'll give it a test here and report back how it went.

jviotti commented 8 years ago

Hi @jeromew , @regular

I can confirm the patches works fine and completely resolves the issue for me too (reverting the patches causes the bug to re-appear). I also checked that the uncompressed result matches the original uncompressed file, which is the case as well.

Thanks a lot for the amazing work @jeromew. Looking forward to have this merged!

regular commented 8 years ago

@jeromew I am very glad you found the reason for the buffer being too small! Thanks for the good work and sorry for the delay (I was in the wilderness).

regular commented 8 years ago

Published as 1.0.10

malikalimoekhamedov commented 5 years ago

Hello friends,

I'm afraid I should reopen this issue as I'm facing it again in version 6.9.0 of unbzip2-stream. Here's what I do:

axios(axiosConfig)
                    .then((response) => {
                        if (response.status === 200) {
                            response.data
                                .pipe(bz2())
                                .pipe(tarExtract)
                                .on('error', (err) => {
                                    console.error(error);
                                    process.exit(1);
                                });
                        }
                    })
                    .catch((err) => {
                        console.log(err);
                    });

And here's the traceback I'm getting:

internal/streams/legacy.js:59
      throw er; // Unhandled stream error in pipe.
      ^

TypeError: Cannot read property '0' of undefined
    at f (.../node_modules/unbzip2-stream/lib/bit_iterator.js:30:34)
    at Object.bzip2.decompress (.../node_modules/unbzip2-stream/lib/bzip2.js:278:13)
    at decompressBlock (.../node_modules/unbzip2-stream/index.js:30:29)
    at decompressAndQueue (.../node_modules/unbzip2-stream/index.js:47:20)
    at Stream.end (.../node_modules/unbzip2-stream/index.js:82:17)
    at _end (.../node_modules/through/index.js:65:9)
    at Stream.stream.end (.../node_modules/through/index.js:74:5)

It seems to be the same line as the one discussed previously.

I'm trying to download a bunch of large files from a remote Apche server via https with axios. I'm hitting this error randomly. Sometimes sooner, sometimes later. Some files are being correctly decompressed but then I hit this wall and everything stops. Obviously, it mainly happens with files that are larger than others.

Can anyone cast some light on this issue?

sfriesel commented 5 years ago

@antikvarBE this is likely a different error because it happens in Stream.end (when all input data is available), while the original bug was about reading too far ahead in Stream.write, in the middle of the stream. Again the error could simply be caused by a truncated input stream; there's no explicit error message for that yet. Otherwise please open a new issue, ideally with a sample file that works in bzip2 commandline tool.

regular / unbzip2-stream

TypeError: Cannot read property '0' of undefined #7

9 in spirit is very close to what @jviotti was suggesting (give more breathing room) but is supported by a real reason : the maximum compressed block size necessary is `maxBlockHeaderSize + blockSize * 100kb`.

regular / unbzip2-stream

TypeError: Cannot read property '0' of undefined #7

9 in spirit is very close to what @jviotti was suggesting (give more breathing room) but is supported by a real reason : the maximum compressed block size necessary is maxBlockHeaderSize + blockSize * 100kb.

9 in spirit is very close to what @jviotti was suggesting (give more breathing room) but is supported by a real reason : the maximum compressed block size necessary is `maxBlockHeaderSize + blockSize * 100kb`.