Open puzpuzpuz opened 4 years ago
The pool could store weak refs to internal buffers (let's call them source buffers) and register all slices (as "holdings") with their source buffer (as "the object whose lifetime we're concerned with") in a separate
FinalizationGroup
per source buffer.
Don’t you mean the other way around? The slices are what should be tracked, not the source buffers, right?
Another concern is that it's an experimental API, so I'm not sure if
FinalizationGroup
s are available internally without--harmony-weak-refs
flag (or planned to become available), like it's done for internalWeakReference
API.
I don’t know what the V8 team’s plan on this is but I guess whatever we implement could be behind that flag for now? Weakrefs are a stage 3 proposal, so I think we can be relatively certain that they’re going to be standardized and implemented without requiring flags in the future.
WeakReference implementation in v8 seems ready, it only blocked by chromium integration https://bugs.chromium.org/p/v8/issues/detail?id=8179
Don’t you mean the other way around? The slices are what should be tracked, not the source buffers, right?
Sure. We need to track slices as source buffer's "holdings". FinalizationGroup
API on its own may be sufficient to achieve this.
Thanks for noticing this incorrectness. I'll update the description.
I don’t know what the V8 team’s plan on this is but I guess whatever we implement could be behind that flag for now? Weakrefs are a stage 3 proposal, so I think we can be relatively certain that they’re going to be standardized and implemented without requiring flags in the future.
Yeah, nothing prevents activating this mechanism under the flag.
As a follow-up, I have created a PoC repo: https://github.com/puzpuzpuz/nbufpool
It's in a very early draft stage at the moment. But at least the pool seems to be working (not covered with tests yet).
I have a couple of questions so far:
--harmony-weak-refs
, I had to add --noincremental-marking
flag. Otherwise node crashes:
$ node --harmony-weak-refs benchmark/benchmark.js
FATAL ERROR: v8::HandleScope::CreateHandle() Cannot create a handle without a HandleScope
1: 0x9f0390 node::Abort() [node]
2: 0x9f25e2 node::OnFatalError(char const*, char const*) [node]
3: 0xb4be1a v8::Utils::ReportApiFailure(char const*, char const*) [node]
4: 0xcc470a v8::internal::HandleScope::Extend(v8::internal::Isolate*) [node]
5: 0xb4d5b1 v8::HandleScope::CreateHandle(v8::internal::Isolate*, unsigned long) [node]
6: 0x9939ac node::Environment::RunWeakRefCleanup() [node]
7: 0x95fc47 node::InternalCallbackScope::~InternalCallbackScope() [node]
8: 0xa58bfe node::PerIsolatePlatformData::RunForegroundTask(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [node]
9: 0xa59652 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [node]
10: 0x130eb11 [node]
11: 0x1321118 [node]
12: 0x130f49b uv_run [node]
13: 0xa2d493 node::NodeMainInstance::Run() [node]
14: 0x9c1311 node::Start(int, char**) [node]
15: 0x7f04c6165b97 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
16: 0x95ed25 [node]
Aborted (core dumped)
I see that all tests for FinalizationGroup
in node are assumed to be run with --noincremental-marking
. Is it a known issue?
Buffer.allocUnsafe
? I'd like to understand their scenario, especially allocation rate that is sufficient for performance testing. Benchmark in the PoC repo is very primitive and I'd like to choose the right direction for improving it.Is it a known issue?
Yes, https://github.com/nodejs/node/pull/30616 should address that :)
Could you point me at existing benchmarks for Buffer.allocUnsafe?
I think benchmark/buffers/buffer-creation.js would be the main one?
Yes, #30616 should address that :)
That's great. I'll be watching this PR.
I think benchmark/buffers/buffer-creation.js would be the main one?
Thanks. This one is even more primitive than mine, but it's still valuable as I can look at buf sizes and number of ops.
@puzpuzpuz I’ve merged #30616, that crash should go away now :)
@addaleax I tested both v13 and v14 nightly builds. v14 works fine, but v13 still crashes. I guess that the commit just wasn't included into the latest nightly and I should wait for one more day, right?
@puzpuzpuz We don’t always update release branches regularly, usually only when preparing a release. I’ve pushed that commit to v13.x-staging now, though.
@addaleax Thanks for the clarification and for the push. I'm going to use v13 nightly for further experiments.
I improved the benchmark and spent some time on experiments.
Here are results for the most simple pool implementation (no constraints on pool size or throttling):
// Buffer.allocUnsafe with Buffer.poolSize = 2MB
Starting benchmark: type=no-pool-2mb, iterations=1024, ops per iteration=1024
Benchmark run finished: size=8192, time=1.6633010910000001, rate=630418.6329665553
Benchmark run finished: size=16384, time=1.796971164, rate=583524.1104625762
Benchmark run finished: size=32768, time=2.04675067, rate=512312.5231467494
Benchmark run finished: size=65536, time=1.504311737, rate=697047.0110744074
Benchmark run finished: size=131072, time=2.06939198, rate=506707.28896900435
Benchmark run finished: size=262144, time=4.269595161, rate=245591.4344240564
Benchmark run finished: size=524288, time=9.601368324, rate=109211.10039898519
Benchmark finished
// Simplest pool implementation
Starting benchmark: type=pool-2mb, iterations=1024, ops per iteration=1024
Benchmark run finished: size=8192, time=2.145832006, rate=488657.07896427007
Pool stats: reclaimedCnt: 3880, reusedCnt: 3648, allocatedCnt: 448, sourcePoolSize: 232
Benchmark run finished: size=16384, time=2.160708108, rate=485292.75940496445
Pool stats: reclaimedCnt: 7679, reusedCnt: 7480, allocatedCnt: 712, sourcePoolSize: 199
Benchmark run finished: size=32768, time=2.203816723, rate=475800.00145048357
Pool stats: reclaimedCnt: 15680, reusedCnt: 15440, allocatedCnt: 944, sourcePoolSize: 240
Benchmark run finished: size=65536, time=2.249286621, rate=466181.5840676714
Pool stats: reclaimedCnt: 32064, reusedCnt: 31200, allocatedCnt: 1568, sourcePoolSize: 864
Benchmark run finished: size=131072, time=2.496499574, rate=420018.4974676066
Pool stats: reclaimedCnt: 64320, reusedCnt: 63040, allocatedCnt: 2496, sourcePoolSize: 1280
Benchmark run finished: size=262144, time=2.773130518, rate=378119.95980493556
Pool stats: reclaimedCnt: 126464, reusedCnt: 125696, allocatedCnt: 5376, sourcePoolSize: 768
Benchmark run finished: size=524288, time=3.580863628, rate=292827.68318816304
Pool stats: reclaimedCnt: 261888, reusedCnt: 251648, allocatedCnt: 10496, sourcePoolSize: 10240
Benchmark finished
Considerations:
Buffer.allocUnsafe
. So, the pool has some potential.Buffer.allocUnsafe
) does a better job, then FG-based buffer reclamation. FG has to do bookkeeping for tens thousands of slices and its overhead overweights the value of buffer reusage. I also profiled these scenarios and FG calls are the most "heavy" calls in the profiler report.Buffer.allocUnsafeSlow
) and reusage of buffers from the pool. Say, in the 512KB scenario it's 10496 for allocations vs. 251648 for reusages for the pool (~1/25).Further plans:
setImmediate
callbacks in a loop within each iteration and keeping strong references for allocated buffers until the next iteration. This should be close enough to what happens in network apps.Any feedback is highly welcome.
A small update.
I just realized that there is a way to significantly reduce amount of FG bookkeeping by introducing another level of indirection. :)
Going to implement it and measure the impact.
@addaleax As planned, I have reduced amount of FG bookkeeping and it improved the performance. I also implemented primitive shrinking of the pooled buffers by storing pool's contents in a WeakRef.
However, there is one issue that blocks me. It looks like FG stores strong references to all "holdings" until forever. Even when the finalizer callback was triggered and "holdings" of GCed objects were iterated, references to them are kept internally in FG (or at least, that's how it looks like for the outside).
As I was trying to use source buffers as "holdings" in FinalizationGroup.register
call (and an intermediate slice as the key object, i.e. the first argument), it becomes really critical: the off-heap memory keeps growing during the benchmark and it's never freed.
You may see it if you try to run node --harmony-weak-refs benchmark/benchmark.js
in the experimental library. On my machine memory consumption of the Node.js process goes up to almost 22GB.
I tried to introduce an intermediate map with <int-seq-num, source-buffer>
pairs and it solved the memory problem, but that approach is obviously slower.
Does it sound like expected behavior of FG API or a bug to you? If that's expected, does it mean that the API can only accept on-heap objects and primitives?
@puzpuzpuz I’m not sure – it looks like a bug in V8 to me, yes. Which I guess is fair for a feature behind an experimental flag.
I’m seeing the behaviour with the latest V8 master too – can you maybe try to break this down into a smaller reproduction?
@addaleax I wasn't expecting feature under experimental flag to be rock solid. I'm going to write a simple reproducer and create a separate GH issue.
Created #30853 to track the memory consumption issue.
I find myself wanting this very feature and am curious if this idea ran into technical issues, and, indeed, what folk are doing instead?
The problem
At the moment
Buffer.allocUnsafe
uses a pre-allocated internalBuffer
which is then sliced for newly allocated buffers, when possible. The size of the pool may be configured by settingBuffer.poolSize
property (8KB by default). Once the current internal buffer fills up, a new one is allocated by the pool.Here is the related fragment of the official documentation:
While this mechanism certainly improves performance, it doesn't implement actual pooling, i.e. reclamation and reuse of buffers. Thus it doesn't decrease allocation rate in scenarios when
Buffer.allocUnsafe
is on the hot path (this is a common situation for network client libraries, e.g. DB drivers).This issue is aimed to discuss the value of such pooling mechanism and provide a good starting point for initial design feedback.
There are two main options for pooling in Node.js:
It appears that option 2 is feasible with experimental APIs (FinalizationGroup and, maybe, WeakRefs).
The solution
The implementation idea was described by @addaleax (see https://github.com/nodejs/node/issues/30611#issuecomment-557841480)
FinalizationGroup
API could be enough to implement a pool that would reuse internal buffers once all slices pointing at them were GCed. When the pool uses an active internal buffer (let's call it a source buffer) to create slices, it could register those slices (as "the object whose lifetime we're concerned with") with their source buffer (as "holdings") in aFinalizationGroup
. Once a finalizer function is called and it reports that all slices were GCed (it could be based on a simple counter), the pool could reclaim the corresponding source buffer and reuse it.There are some concerns with this approach.
First, this pooling mechanism may potentially lead to performance degradation under certain conditions. For large source buffers the cost of bookkeeping of all slices in the
FinalizationGroup
may be too high. For small source buffers (especially with 8KB source buffers which is the default in the standard API's global pool) it's probably not worth it at all. This could be mitigated by using a hybrid approach: use current mechanism in case for small buffer allocation as the fallback. So, some PoC experiments and benchmarking have to be done.Another concern is that it's an experimental API, so I'm not sure if
FinalizationGroup
s are available internally without--harmony-weak-refs
flag (or planned to become available), like it's done for internalWeakReference
API.Proposed plan:
Alternatives
Such pool could be certainly implemented as an npm module. However, support in node core could bring performance improvements to a much broader audience.