Open mscdex opened 4 years ago
@mscdex Can you provide the output uname -a
(what the issue template asks for for Platform:
)? Is this on arm/arm64/x86/x64/something else? That’s probably relevant here.
Added.
Additionally, I'm using gcc 7.4.0 to compile if that matters.
Can you try #32116? If this is some regression on V8, it might already be fixed (or it might be worse on 8.1, but let's hope not).
The prof output makes it seem like the reason is that std::shared_ptr
as we compile it on x64 seems to not use atomic instructions, but rather mutexes. I don’t quite know why it does that, but it’s probably something that can be addressed by passing compile flags. (That might be ABI-breaking but we could do it for v14.x, see https://github.com/nodejs/node/issues/30786 for a related ARM issue.)
Or it’s unrelated to that, but that’s not what the prof output says.
@mmarchini The V8 8.1 branch has the same performance as master.
(__gnu_cxx::_Lock_policy)2
is _S_atomic
, though, so … not quite sure that that’s actually responsible for the mutex lock/unlock calls.
I've also now tried:
Compiling v12.16.1 locally (was using the binary from the website before)
Compiling master with gcc 9.2.1
Compiling master with -march=native
Removing calls to GetBackingStore()
where possible
Reverting to GetContents()
in places
All changes made little to no difference.
@mscdex do you see similar results with any of the benchmarks from benchmark/buffers? Maybe some benchmark similar to your private code?
@mmarchini Yes, for example:
confidence improvement accuracy (*) (**) (***)
buffers/buffer-tostring.js n=1000000 len=1024 args=0 encoding='utf8' *** -31.66 % ±0.42% ±0.55% ±0.72%
The --prof-process
output for that buffer-tostring
benchmark shows basically the same top results as I showed originally.
If that helps, I can also see the same performance degradation on 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux, gcc 7.5.0:
$ node benchmark/compare.js --old ../node-v12.x/node --new ./node --filter buffer-tostring --runs 10 --set len=1024 --set encoding=utf8 buffers | Rscript benchmark/compare.R
[00:00:25|% 100| 1/1 files | 20/20 runs | 3/3 configs]: Done
confidence improvement accuracy (*) (**) (***)
buffers/buffer-tostring.js n=1000000 len=1024 args=0 encoding='utf8' *** -34.37 % ±1.24% ±1.70% ±2.32%
buffers/buffer-tostring.js n=1000000 len=1024 args=1 encoding='utf8' *** -31.29 % ±0.59% ±0.82% ±1.12%
buffers/buffer-tostring.js n=1000000 len=1024 args=3 encoding='utf8' *** -32.47 % ±0.98% ±1.34% ±1.83%
@addaleax I think there's a mix of atomics and mutexes locks going on:
And looking at V8 source code, they explicitly use mutexes (backing-store.cc#L518, mutex.h#L292, mutex.cc#L14-L32).
Looking at git blame and git history, the perf regression was likely introduced on v8/v8@b6b7de0d60b1dc4b72e953828e1d8d7333d1f6ce, but I'm not entirely sure. If this is a V8 issue, it would be good to have a more isolated repro available.
Edit: The third time the commit above was reverted, it was because V8 identified a performance regression. Maybe the regression was not fixed yet?
I'm not sure if there's a way to trigger it without using any node-specific API, but here is a more minimal example that should still reproduce the issue:
'use strict';
const buf = Buffer.alloc(100, 'a');
console.time('slice');
for (let i = 0; i < 1e7; ++i)
buf.utf8Slice(0, 32);
console.timeEnd('slice');
Sorry, should've been more specific, a V8-only repro.
Or maybe a small Node.js repro is enough in this case? cc @nodejs/v8
@mmarchini Right, like I said I don't know that that's possible because looking through the V8 source, there are not many places that call GetBackingStore()
directly or indirectly and those instances that do exist are more part of the C++ API and not methods that would be triggered from JS land.
The problem is not specific to Buffer
, ArrayBuffer
uses the new GetBackingStore()
too and has the same problem. Reverting to GetContents()
won't help because the new GetContents()
has been built around the new GlobalBackingStoreRegistry
and it is essentially the same function as GetBackingStore()
. If you short most of GlobalBackingStoreRegistry::Register()
, you will recover most of the lost performance.
Same example as @mscdex but with ArrayBuffer, still the same performance drop
'use strict';
const len = 512;
const buf = new ArrayBuffer(len);
const view = new Int32Array(buf);
console.time('toString ' + len);
for (let i = 0; i < 1e6; i++) {
view.slice(0, 32).toString();
}
console.timeEnd('toString ' + len);
Is it possible Node could avoid at least some of these regressions by safely caching the result of GetBackingStore()
?
I don't think it would be a good idea. Besides, even if you solve the problem for this artificial benchmark, this won't apply to the real world apps
@mmomtchev I was thinking more along the lines of caching it at the time of Buffer construction (or lazily caching it), which would apply to everything. However I'm not sure if the value returned by GetBackingStore()
can change during the lifetime of the Buffer.
@mscdex, yes, I was thinking about caching the returned value
There is a Google document on the new BackingStore framework: https://docs.google.com/document/d/1sTc_jRL87Fu175Holm5SV0kajkseGl2r8ifGY76G35k/edit#heading=h.5irk4csrpu0y
and a discussion here:
https://bugs.chromium.org/p/v8/issues/detail?id=9908
Keeping a separate copy of the pointers will come to re-implementing their work (with less features for faster performance) in Node
I think that the V8 team should solve this on their side - especially since ArrayBuffer is impacted too
The problem is not that severe, this is probably the worst case scenario.
For Buffers up to 16 bytes, the function call mechanism is the leading CPU hog, so there is almost no difference between Node 12 and Node 14
For Buffers of 32 bytes to 64 bytes the difference is about 30%
Then as you go further up, the GetBackingStore()
part is less and less important
In fact, when thinking again, there is no reason why very small Buffers are not impacted - I am looking at the numbers and I wonder if it has something to do with the JIT - I don't really have an explanation
@mmomtchev just ran into this performance issue too and it seems like there are two issues at play:
Buffer.write
call overhead has increased (due to GetBackingStore
during 12.x) - impacting the short writes more. Handwriting short strings to Buffer can be up to 20x faster (see this gist for more)Buffer.write
performance seems to have improved for strings >100 bytes - which might cause some of the benchmarks to miss the perf degradation. Now, for cases where the buffer is allocated using Buffer.alloc(N)
which ultimately ends up calling AllocatedBuffer::AllocateManaged - wouldn't it make sense for the Buffer
instance to hold a reference to the BackingStore
that way all the Buffer methods (there's quite a few of them) that need access to it won't pay GetBackingStore
overhead (cc @addaleax )
wouldn't it make sense for the
Buffer
instance to hold a reference to theBackingStore
that way all the Buffer methods (there's quite a few of them) that need access to it won't payGetBackingStore
overhead (cc @addaleax )
I guess the question there would be … how would we do that in a way that’s more efficient than calling GetBackingStore
?
@addaleax - is my understanding below correct?
Uint8Array
Uint8Array
object @addaleax - is my understanding below correct?
- Buffer is effectively a thin wrapper around
Uint8Array
Not really – Buffer objects are Uint8Array
s, only with a different (subclassed) prototype.
- All methods of Buffer are called while holding a strong reference to the underlying
Uint8Array
object
There is no underlying object in that sense, but yes, in order to be able to call a method on an object, you need to hold a reference to it.
- thus the backing store is guaranteed to not be going out of scope while a Buffer method is still running
Yes, that’s always true.
Got it, so would it make sense to store the the backing store pointer at c'tor time (via SetPrivate
or the like) and use that instead of GetBackingStore
?
@ledbit That’s something one could try out, yeah. Two things to keep in mind:
Getting the pointer that way might still turn out to be slow.
@addaleax you're right. It turns out that storing and accessing fields using S/GetPrivate
is just as expensive as using GetBackingStore
. Adding a ArrayBuffer::BackingStoreUnsafe
achieves the goal, but that's v8
land ... back to square 1.
@ledbit Just so you know, V8 is generally receptive to performance-related feature requests from our side. If we can be specific about what exactly we want, this is not a total blocker.
@addaleax the ask in this case would be to provide a method for accessing the backing store, in an explicitly unsafe manner as part of the Buffer
methods. I'd be happy to take the lead and work with the V8 team if you can provide some guidance.
void* v8::ArrayBuffer::BackingStoreUnsafe() const {
i::Handle<i::JSArrayBuffer> obj = Utils::OpenHandle(this);
return obj->backing_store();
}
which could then be used as part of SPREAD_BUFFER_ARG
- (diff below in context)
- SPREAD_BUFFER_ARG(args.This(), ts_obj);
+ v8::Local<v8::ArrayBufferView> name = args.This().As<v8::ArrayBufferView>();
+ const size_t ts_obj_offset = name->ByteOffset();
+ const size_t ts_obj_length = name->ByteLength();
+ char* const ts_obj_data = static_cast<char*>(name->Buffer()->BackingStoreUnsafe()) + ts_obj_offset;
Perf observations (Node 14.15.1, time in ms for 1e7 ops) for Buffer.write
from this gist, but same should be applicable to these methods: Fill, StringWrite,Copy, SwapXX, StringSlice, CompareOffset, IndexOfString, IndexOfBuffer,IndexOfNumber, EncodeInto
StrLen | BackingStoreUnsafe | SPREAD_BUFFER_ARG |
---|---|---|
1 | 795 | 1443 |
4 | 945 | 1603 |
8 | 913 | 1555 |
16 | 957 | 1671 |
32 | 971 | 1671 |
64 | 999 | 1636 |
128 | 1031 | 1657 |
256 | 1086 | 1583 |
512 | 1231 | 1796 |
I assume that this also reduced the perf of js-bson.
Just FYI, we're hitting a similar issue with our debugger which is based on node and does a lot of small reads in C API using small arrays buffers. The original code was using libffi then we went to directly napi then v8. All three implementations are showing strong speed reductions which we were able to pinpoint to this backing store access when switching node version.
We experiment 60ms lag when querying the backing store which makes it very impractical to the point that we can't upgrade our system to newest nodejs/v8. See https://github.com/microsoft/vscode/issues/138467 for our original issue.
I'm also experiencing this with node 17... Are there any updates? https://bugs.chromium.org/p/v8/issues/detail?id=10343 seems inactive
Sorry for the delay!
I added a new V8 API for accessing the raw data pointer more efficiently: https://chromium-review.googlesource.com/c/v8/v8/+/3764341
(This should be included in V8 10.5 when it ships.)
If I modify ArrayBufferViewContents::Read to use the new API, the timing for the example below goes from 1000 ms -> 650 ms in local measurements:
'use strict';
const buf = Buffer.alloc(100, 'a');
console.time('slice');
for (let i = 0; i < 1e7; ++i)
buf.utf8Slice(0, 32);
console.timeEnd('slice');
I didn't read through all the comments in this bug so I might be missing some nuance; please let me know if there are other regressions we should address on the V8 side.
Since that change is so minimal, it might make a nice backport to existing supported node branches?
For those following, I am working on the backport in #43921 and then will put up a series of followup diffs to replace various usages of GetBackingStore()->Data()
with Data()
: https://github.com/nodejs/node/compare/main...kvakil:node:remove-backingstores
Linux foo 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
I was running some benchmarks (for private code) and noticed a significant slowdown with some Buffer methods. Here is a comparison of the C++ portion of
--prof
between v12.16.1 and master:v12.16.1:
master:
As you will see, master has these additional items at the top of the list:
Is there some way we can avoid this slowdown?