buffer: implement pooling mechanism for Buffer.allocUnsafe

puzpuzpuz commented 4 years ago

The problem

At the moment Buffer.allocUnsafe uses a pre-allocated internal Buffer which is then sliced for newly allocated buffers, when possible. The size of the pool may be configured by setting Buffer.poolSize property (8KB by default). Once the current internal buffer fills up, a new one is allocated by the pool.

Here is the related fragment of the official documentation:

The Buffer module pre-allocates an internal Buffer instance of size Buffer.poolSize that is used as a pool for the fast allocation of new Buffer instances created using Buffer.allocUnsafe() and the deprecated new Buffer(size) constructor only when size is less than or equal to Buffer.poolSize >> 1 (floor of Buffer.poolSize divided by two).

While this mechanism certainly improves performance, it doesn't implement actual pooling, i.e. reclamation and reuse of buffers. Thus it doesn't decrease allocation rate in scenarios when Buffer.allocUnsafe is on the hot path (this is a common situation for network client libraries, e.g. DB drivers).

This issue is aimed to discuss the value of such pooling mechanism and provide a good starting point for initial design feedback.

There are two main options for pooling in Node.js:

Provide an API for manual reference counting. This approach is unsafe and requires a lot of discipline from uses, so it doesn't sound like a good candidate.
Integrate with GC for buffer reclamation. This one is more complicated and requires integration with GC.

It appears that option 2 is feasible with experimental APIs (FinalizationGroup and, maybe, WeakRefs).

The solution

The implementation idea was described by @addaleax (see https://github.com/nodejs/node/issues/30611#issuecomment-557841480)

FinalizationGroup API could be enough to implement a pool that would reuse internal buffers once all slices pointing at them were GCed. When the pool uses an active internal buffer (let's call it a source buffer) to create slices, it could register those slices (as "the object whose lifetime we're concerned with") with their source buffer (as "holdings") in a FinalizationGroup. Once a finalizer function is called and it reports that all slices were GCed (it could be based on a simple counter), the pool could reclaim the corresponding source buffer and reuse it.

There are some concerns with this approach.

First, this pooling mechanism may potentially lead to performance degradation under certain conditions. For large source buffers the cost of bookkeeping of all slices in the FinalizationGroup may be too high. For small source buffers (especially with 8KB source buffers which is the default in the standard API's global pool) it's probably not worth it at all. This could be mitigated by using a hybrid approach: use current mechanism in case for small buffer allocation as the fallback. So, some PoC experiments and benchmarking have to be done.

Another concern is that it's an experimental API, so I'm not sure if FinalizationGroups are available internally without --harmony-weak-refs flag (or planned to become available), like it's done for internal WeakReference API.

Proposed plan:

Gather initial feedback
Implement a PoC and benchmark it
Discuss next steps (potentially, support in node core)

Alternatives

Such pool could be certainly implemented as an npm module. However, support in node core could bring performance improvements to a much broader audience.

addaleax commented 4 years ago

The pool could store weak refs to internal buffers (let's call them source buffers) and register all slices (as "holdings") with their source buffer (as "the object whose lifetime we're concerned with") in a separate FinalizationGroup per source buffer.

Don’t you mean the other way around? The slices are what should be tracked, not the source buffers, right?

Another concern is that it's an experimental API, so I'm not sure if FinalizationGroups are available internally without --harmony-weak-refs flag (or planned to become available), like it's done for internal WeakReference API.

I don’t know what the V8 team’s plan on this is but I guess whatever we implement could be behind that flag for now? Weakrefs are a stage 3 proposal, so I think we can be relatively certain that they’re going to be standardized and implemented without requiring flags in the future.

lin7sh commented 4 years ago

WeakReference implementation in v8 seems ready, it only blocked by chromium integration https://bugs.chromium.org/p/v8/issues/detail?id=8179

puzpuzpuz commented 4 years ago

Don’t you mean the other way around? The slices are what should be tracked, not the source buffers, right?

Sure. We need to track slices as source buffer's "holdings". FinalizationGroup API on its own may be sufficient to achieve this.

Thanks for noticing this incorrectness. I'll update the description.

I don’t know what the V8 team’s plan on this is but I guess whatever we implement could be behind that flag for now? Weakrefs are a stage 3 proposal, so I think we can be relatively certain that they’re going to be standardized and implemented without requiring flags in the future.

Yeah, nothing prevents activating this mechanism under the flag.

puzpuzpuz commented 4 years ago

As a follow-up, I have created a PoC repo: https://github.com/puzpuzpuz/nbufpool

It's in a very early draft stage at the moment. But at least the pool seems to be working (not covered with tests yet).

I have a couple of questions so far:

Apart from --harmony-weak-refs, I had to add --noincremental-marking flag. Otherwise node crashes:

$ node --harmony-weak-refs benchmark/benchmark.js
FATAL ERROR: v8::HandleScope::CreateHandle() Cannot create a handle without a HandleScope
1: 0x9f0390 node::Abort() [node]
2: 0x9f25e2 node::OnFatalError(char const*, char const*) [node]
3: 0xb4be1a v8::Utils::ReportApiFailure(char const*, char const*) [node]
4: 0xcc470a v8::internal::HandleScope::Extend(v8::internal::Isolate*) [node]
5: 0xb4d5b1 v8::HandleScope::CreateHandle(v8::internal::Isolate*, unsigned long) [node]
6: 0x9939ac node::Environment::RunWeakRefCleanup() [node]
7: 0x95fc47 node::InternalCallbackScope::~InternalCallbackScope() [node]
8: 0xa58bfe node::PerIsolatePlatformData::RunForegroundTask(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [node]
9: 0xa59652 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [node]
10: 0x130eb11  [node]
11: 0x1321118  [node]
12: 0x130f49b uv_run [node]
13: 0xa2d493 node::NodeMainInstance::Run() [node]
14: 0x9c1311 node::Start(int, char**) [node]
15: 0x7f04c6165b97 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
16: 0x95ed25  [node]
Aborted (core dumped)

I see that all tests for FinalizationGroup in node are assumed to be run with --noincremental-marking. Is it a known issue?

Could you point me at existing benchmarks for Buffer.allocUnsafe? I'd like to understand their scenario, especially allocation rate that is sufficient for performance testing. Benchmark in the PoC repo is very primitive and I'd like to choose the right direction for improving it.

addaleax commented 4 years ago

Is it a known issue?

Yes, https://github.com/nodejs/node/pull/30616 should address that :)

Could you point me at existing benchmarks for Buffer.allocUnsafe?

I think benchmark/buffers/buffer-creation.js would be the main one?

puzpuzpuz commented 4 years ago

Yes, #30616 should address that :)

That's great. I'll be watching this PR.

I think benchmark/buffers/buffer-creation.js would be the main one?

Thanks. This one is even more primitive than mine, but it's still valuable as I can look at buf sizes and number of ops.

addaleax commented 4 years ago

@puzpuzpuz I’ve merged #30616, that crash should go away now :)

puzpuzpuz commented 4 years ago

@addaleax I tested both v13 and v14 nightly builds. v14 works fine, but v13 still crashes. I guess that the commit just wasn't included into the latest nightly and I should wait for one more day, right?

addaleax commented 4 years ago

@puzpuzpuz We don’t always update release branches regularly, usually only when preparing a release. I’ve pushed that commit to v13.x-staging now, though.

puzpuzpuz commented 4 years ago

@addaleax Thanks for the clarification and for the push. I'm going to use v13 nightly for further experiments.

puzpuzpuz commented 4 years ago

I improved the benchmark and spent some time on experiments.

Here are results for the most simple pool implementation (no constraints on pool size or throttling):

// Buffer.allocUnsafe with Buffer.poolSize = 2MB
Starting benchmark: type=no-pool-2mb, iterations=1024, ops per iteration=1024
Benchmark run finished: size=8192, time=1.6633010910000001, rate=630418.6329665553
Benchmark run finished: size=16384, time=1.796971164, rate=583524.1104625762
Benchmark run finished: size=32768, time=2.04675067, rate=512312.5231467494
Benchmark run finished: size=65536, time=1.504311737, rate=697047.0110744074
Benchmark run finished: size=131072, time=2.06939198, rate=506707.28896900435
Benchmark run finished: size=262144, time=4.269595161, rate=245591.4344240564
Benchmark run finished: size=524288, time=9.601368324, rate=109211.10039898519
Benchmark finished

// Simplest pool implementation
Starting benchmark: type=pool-2mb, iterations=1024, ops per iteration=1024
Benchmark run finished: size=8192, time=2.145832006, rate=488657.07896427007
Pool stats:  reclaimedCnt: 3880, reusedCnt: 3648, allocatedCnt: 448, sourcePoolSize: 232
Benchmark run finished: size=16384, time=2.160708108, rate=485292.75940496445
Pool stats:  reclaimedCnt: 7679, reusedCnt: 7480, allocatedCnt: 712, sourcePoolSize: 199
Benchmark run finished: size=32768, time=2.203816723, rate=475800.00145048357
Pool stats:  reclaimedCnt: 15680, reusedCnt: 15440, allocatedCnt: 944, sourcePoolSize: 240
Benchmark run finished: size=65536, time=2.249286621, rate=466181.5840676714
Pool stats:  reclaimedCnt: 32064, reusedCnt: 31200, allocatedCnt: 1568, sourcePoolSize: 864
Benchmark run finished: size=131072, time=2.496499574, rate=420018.4974676066
Pool stats:  reclaimedCnt: 64320, reusedCnt: 63040, allocatedCnt: 2496, sourcePoolSize: 1280
Benchmark run finished: size=262144, time=2.773130518, rate=378119.95980493556
Pool stats:  reclaimedCnt: 126464, reusedCnt: 125696, allocatedCnt: 5376, sourcePoolSize: 768
Benchmark run finished: size=524288, time=3.580863628, rate=292827.68318816304
Pool stats:  reclaimedCnt: 261888, reusedCnt: 251648, allocatedCnt: 10496, sourcePoolSize: 10240
Benchmark finished

Considerations:

For 256KB and 512KB scenarios the pool does a much better job than Buffer.allocUnsafe. So, the pool has some potential.
For smaller chunks allocation rate is small and amount of slices is high, so garbage collection (i.e. plain Buffer.allocUnsafe) does a better job, then FG-based buffer reclamation. FG has to do bookkeeping for tens thousands of slices and its overhead overweights the value of buffer reusage. I also profiled these scenarios and FG calls are the most "heavy" calls in the profiler report.
In the current state, the good proportion between allocations (with Buffer.allocUnsafeSlow) and reusage of buffers from the pool. Say, in the 512KB scenario it's 10496 for allocations vs. 251648 for reusages for the pool (~1/25).
As the pool size is unlimited, it may grow up to very large values (say, 10240 in the 512KB scenario). This is bad, as it creates additional GC and memory pressure. I got even better result (~2 secs) in the 512KB scenario when I restricted pool size by 200 buffers.
The benchmark still seems sub-optimal to me, as it doesn't do a good job at emulating high-load network applications. In such applications a few allocations happen in lots of separate tasks (per HTTP request or TCP message). Usually, once the request/message is written into the socket, the buffer may be GCed.

Further plans:

Come up with a better benchmark. I'm thinking of scheduling setImmediate callbacks in a loop within each iteration and keeping strong references for allocated buffers until the next iteration. This should be close enough to what happens in network apps.
Try to restrict the size of the pool, while keeping the allocations/reusage proportion as low as possible.
Implement automatic shrinking of the pool on idle.

Any feedback is highly welcome.

puzpuzpuz commented 4 years ago

A small update.

I just realized that there is a way to significantly reduce amount of FG bookkeeping by introducing another level of indirection. :)

Going to implement it and measure the impact.

puzpuzpuz commented 4 years ago

@addaleax As planned, I have reduced amount of FG bookkeeping and it improved the performance. I also implemented primitive shrinking of the pooled buffers by storing pool's contents in a WeakRef.

However, there is one issue that blocks me. It looks like FG stores strong references to all "holdings" until forever. Even when the finalizer callback was triggered and "holdings" of GCed objects were iterated, references to them are kept internally in FG (or at least, that's how it looks like for the outside).

As I was trying to use source buffers as "holdings" in FinalizationGroup.register call (and an intermediate slice as the key object, i.e. the first argument), it becomes really critical: the off-heap memory keeps growing during the benchmark and it's never freed.

You may see it if you try to run node --harmony-weak-refs benchmark/benchmark.js in the experimental library. On my machine memory consumption of the Node.js process goes up to almost 22GB.

I tried to introduce an intermediate map with <int-seq-num, source-buffer> pairs and it solved the memory problem, but that approach is obviously slower.

Does it sound like expected behavior of FG API or a bug to you? If that's expected, does it mean that the API can only accept on-heap objects and primitives?

addaleax commented 4 years ago

@puzpuzpuz I’m not sure – it looks like a bug in V8 to me, yes. Which I guess is fair for a feature behind an experimental flag.

I’m seeing the behaviour with the latest V8 master too – can you maybe try to break this down into a smaller reproduction?

puzpuzpuz commented 4 years ago

@addaleax I wasn't expecting feature under experimental flag to be rock solid. I'm going to write a simple reproducer and create a separate GH issue.

puzpuzpuz commented 4 years ago

Created #30853 to track the memory consumption issue.

dball commented 5 months ago

I find myself wanting this very feature and am curious if this idea ran into technical issues, and, indeed, what folk are doing instead?

nodejs / node