Use a buffer pool for fs.ReadStream to boost performance?

venkatperi commented 7 years ago

Version:
Platform:
Subsystem:

REQUEST: Would you consider moving to a pool scheme since it'll boost performance without changing the API?

fs.ReadStream allocates 64KiB (default, change via highWaterMark) sized Buffer objects as it reads through files which are released downstream when no longer needed. With larger files, this can result in a lot of time spent allocating Buffer objects only to have them picked up in GC, which in by itself can add up to a lot.

Reducing allocations with a buffer pool should boost performance across the board with respect to file size without foreseeable changes to the public API.

See post for more details.

Performance vs File Size

Higher is better

Percentage Boost in Performance

Average 67% improvement across all sizes

mscdex commented 7 years ago

I'm not sure what the question/request is here, but you can pass your own highWaterMark to fs.createReadStream(), which controls the size of allocated Buffer pools for fs.ReadStream.

venkatperi commented 7 years ago

The issue isn't the size of the buffer (yes, its size can be changed by setting the highWaterMark) , rather the cost of allocation/release of buffers.

TimothyGu commented 7 years ago

Unfortunately, while this test may yield promising results in an in vitro setting, actual adoption of pooling is much more difficult. It would require, on the consumer's part, to inform the core when the consumer is done with a specific buffer. This not only detracts from the JS paradigm of relying on GC, but can also reduce the overall security of the platform, as refcounting can be very difficult to get right in more complex projects. Plus, from the perspective of compatibility, it would require an opt-in, which diminishes gains in actual deployment.

To demonstrate how easily it is to get refcounting wrong, the post has:

Writable.prototype.write = function ( chunk, encoding, cb ) {
  ...
  if ( !(chunk && typeof chunk._unref === 'function') ) {
    return this._unref();
  }
  ...
 }

This simply does not work, as many asynchronous streams (including HTTP streams) assume the chunk that's passed into a writable stream will never be changed. There have in fact been issues filed against Node.js (that I am unable to locate right now) about this behavior, but it is resolved that the status quo provides the best performance and is thus kept.

jorangreef commented 7 years ago

I think this issue was prematurely closed and should be reopened. Node would benefit from more discussion on the subject.

This [...] detracts from the JS paradigm of relying on GC

The question of GC vs static allocation is not a binary decision, and there are ways to design APIs which avoid the problems @TimothyGu describes. Javascript itself does not force anyone to "rely on GC" and Node already provides for static allocation when it comes to file system read() and write() operations. If anything, the long-term trend in Node's API design has been more and more towards providing APIs which accept not just buffers, but also offsets into buffers to avoid pressuring the GC through needless allocations.

If some aspects of Node's APIs currently do not support pulling data into a user-provided buffer (which would facilitate static allocation), then perhaps this is more an indication of a problem in the design of the API in question, then with the principle of static allocation.

One can definitely write Javascript to use static allocation for the heavy "data plane" work, and to use GC for the light "control plane" work, and I think Node needs to move in this direction to remain relevant. It doesn't have to be xor.

Encouraging users to rely on GC for data plane work makes no sense from a throughput point of view, and is going to lead to more issues relating to memory fragmentation and allocator choices (see https://github.com/nodejs/node/issues/6078, https://github.com/nodejs/node/issues/12805 and https://github.com/nodejs/node/issues/8871 for some recent examples).

jorangreef commented 7 years ago

See https://github.com/nodejs/node/pull/6923 for raw numbers to convince.

TimothyGu commented 7 years ago

@jorangreef My objections were based on OP's approach for buffer pulling. I'd be happy if you could file a new issue about pulling into user-controlled buffers (instead of Node.js-owned ones in this issue) – which I agree can be very helpful for throughout when done right – and if you can offer some implementable proposals it would be even better. It would also offer a path to reconcile with Web Streams API's BYOB (bring your own buffer) readers in the future, if that happens.

TimothyGu commented 7 years ago

@jorangreef I just reopened #3403.

jorangreef commented 7 years ago

I'd be happy if you could file a new issue about pulling into user-controlled buffers (instead of Node.js-owned ones in this issue) ... @jorangreef I just reopened #3403.

Thanks very much @TimothyGu , #3403 is the issue for pulling into user-controlled buffers, appreciate your reopening that.

asynchronous streams (including HTTP streams) assume the chunk that's passed into a writable stream will never be changed

I agree with you on this.

nodejs / node

Use a buffer pool for fs.ReadStream to boost performance? #14136

Performance vs File Size

Percentage Boost in Performance