mafintosh / tar-stream

tar-stream is a streaming tar parser and generator.
MIT License
406 stars 92 forks source link

pack backpressure #114

Closed missinglink closed 4 years ago

missinglink commented 4 years ago

Heya,

I'm using this amazing lib to pack GBs of geographic data and then pipe it on to bzip2 for compression. Unfortunately bzip2 is sloooow, so those GBs I packed end up being buffered in nodejs memory waiting for bzip2 to ask for more.

The source of my memory woes seems to be that the return variable from this.push() isn't being checked which results in pack._readableState.buffer.length growing uncontrolled until I run out of RAM 😭

Looking at the source code there is already a this._drain variable which is perfect for implementing backpressure:

diff --git a/pack.js b/pack.js
index ba4eece..f1da3b7 100644
--- a/pack.js
+++ b/pack.js
@@ -128,9 +128,10 @@ Pack.prototype.entry = function (header, buffer, callback) {
   if (Buffer.isBuffer(buffer)) {
     header.size = buffer.length
     this._encode(header)
-    this.push(buffer)
+    var ok = this.push(buffer)
     overflow(self, header.size)
-    process.nextTick(callback)
+    if (ok) process.nextTick(callback)
+    else this._drain = callback
     return new Void()
   }

I've added a simple test case which I'm happy to clean up if this PR is acceptable?

The difference you'll notice in how the test displays are:

Please let me know if you think this is something you would consider including 🙇

mafintosh commented 4 years ago

PR makes sense. Will merge when I’m at my office. Curious, why you aren’t using the streaming api instead of passing the full buffer?

missinglink commented 4 years ago

I'm doing this at the end of a series of object streams:

object_stream() -> object_stream() -> tar_stream() -> shell_stream()

When it comes time to pack files, I am extracting some fields from the object to generate the header and then calling JSON.stringify() on the whole object to serialize it to a JSON string. (it's GeoJSON format)

At this point I already have the data as a string so it seemed appropriate to use the buffer interface.

missinglink commented 4 years ago

I think also the requirement to specify a byte size up-front makes the streaming interface less attractive compared to the buffer interface.

I'm guessing that this is due to a requirement of the TAR format, presumably a leading byte which indicates the content length?

mafintosh commented 4 years ago

Yes tar requires you to declare the size upfront. Usecase makes sense

mafintosh commented 4 years ago

Out in 2.1.2, thanks!

piranna commented 4 years ago

I'm guessing that this is due to a requirement of the TAR format, presumably a leading byte which indicates the content length?

What about making it optional, and try to find the actual lenght if not defined? Or would it lead to problems like memory usage?

mafintosh commented 4 years ago

@piranna that would require buffering the entire stream, so in that case just use the api this PR improves

piranna commented 4 years ago

@piranna that would require buffering the entire stream, so in that case just use the api this PR improves

Yes, i suposse, but in that case, is it possible to do it automatically? :-)

mafintosh commented 4 years ago

You could magically do it on the stream yes, but I think that'd lead to serious bugs for people not realising it buffers. Better docs for this is prob the way to go if we wanna improve it.

missinglink commented 4 years ago

I like the options that developers have, they cater to different use cases:

I have a stream-like thing and I know the length up-front (eg. a file) -> use the streams API

I have some bytes in memory already -> use the buffer API

I have a stream-like thing and I don't know the length up-front (eg. an object stream)

...and I am constrained by memory but am willing to spend CPU -> read through the stream once and count the bytes, use the streams API

...I am not worried about memory usage -> convert the stream to a string/buffer, use the buffer API

piranna commented 4 years ago

You could magically do it on the stream yes, but I think that'd lead to serious bugs for people not realising it buffers. Better docs for this is prob the way to go if we wanna improve it.

My idea was to do under the hood something like:

I have a stream-like thing and I don't know the length up-front (eg. an object stream)

...and I am constrained by memory but am willing to spend CPU -> read through the stream once and count the bytes, use the streams API

But you are right, explaining the use cases in the docs seems a better alternative, probably the developer knows it before hand :-)