buffer: ~2x slowdown in master compared to v12.x

mscdex commented 4 years ago

Version: master
Platform: Linux foo 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Subsystem: buffer

I was running some benchmarks (for private code) and noticed a significant slowdown with some Buffer methods. Here is a comparison of the C++ portion of --prof between v12.16.1 and master:

v12.16.1:

``` [C++]: ticks total nonlib name 66 4.2% 8.4% void node::Buffer::(anonymous namespace)::StringSlice<(node::encoding)1>(v8::FunctionCallbackInfo const&) 54 3.4% 6.9% __libc_read 48 3.0% 6.1% node::Buffer::(anonymous namespace)::ParseArrayIndex(node::Environment*, v8::Local, unsigned long, unsigned long*) 35 2.2% 4.5% v8::ArrayBuffer::GetContents() 35 2.2% 4.5% epoll_pwait 27 1.7% 3.4% v8::internal::Builtin_TypedArrayPrototypeBuffer(int, unsigned long*, v8::internal::Isolate*) 26 1.6% 3.3% v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) 24 1.5% 3.1% node::native_module::NativeModuleEnv::CompileFunction(v8::FunctionCallbackInfo const&) 22 1.4% 2.8% __pthread_cond_signal 20 1.3% 2.5% node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local*) 16 1.0% 2.0% v8::Value::IsArrayBufferView() const 14 0.9% 1.8% v8::ArrayBufferView::Buffer() 9 0.6% 1.1% v8::internal::FixedArray::set(int, v8::internal::Object) 8 0.5% 1.0% v8::ArrayBufferView::ByteLength() 7 0.4% 0.9% v8::Value::IntegerValue(v8::Local) const 6 0.4% 0.8% v8::ArrayBufferView::ByteOffset() 5 0.3% 0.6% __libc_malloc 4 0.3% 0.5% write 4 0.3% 0.5% v8::internal::libc_memmove(void*, void const*, unsigned long) 4 0.3% 0.5% node::binding::GetInternalBinding(v8::FunctionCallbackInfo const&) 3 0.2% 0.4% v8::internal::libc_memset(void*, int, unsigned long) 3 0.2% 0.4% __lll_unlock_wake 2 0.1% 0.3% void node::StreamBase::JSMethod<&node::StreamBase::WriteBuffer>(v8::FunctionCallbackInfo const&) 2 0.1% 0.3% fwrite 2 0.1% 0.3% do_futex_wait.constprop.1 2 0.1% 0.3% __clock_gettime 2 0.1% 0.3% __GI___pthread_mutex_unlock 1 0.1% 0.1% void node::StreamBase::JSMethod<&(int node::StreamBase::WriteString<(node::encoding)1>(v8::FunctionCallbackInfo const&))>(v8::FunctionCallbackInfo const&) 1 0.1% 0.1% v8::internal::Scope::DeserializeScopeChain(v8::internal::Isolate*, v8::internal::Zone*, v8::internal::ScopeInfo, v8::internal::DeclarationScope*, v8::internal::AstValueFactory*, v8::internal::Scope::DeserializationMode) 1 0.1% 0.1% v8::internal::RuntimeCallTimerScope::RuntimeCallTimerScope(v8::internal::Isolate*, v8::internal::RuntimeCallCounterId) 1 0.1% 0.1% v8::EscapableHandleScope::Escape(unsigned long*) 1 0.1% 0.1% std::ostreambuf_iterator > std::num_put > >::_M_insert_int(std::ostreambuf_iterator >, std::ios_base&, char, long) const 1 0.1% 0.1% std::ostream::sentry::sentry(std::ostream&) 1 0.1% 0.1% std::basic_ostream >& std::__ostream_insert >(std::basic_ostream >&, char const*, long) 1 0.1% 0.1% std::__detail::_Prime_rehash_policy::_M_need_rehash(unsigned long, unsigned long, unsigned long) const 1 0.1% 0.1% non-virtual thunk to node::LibuvStreamWrap::GetAsyncWrap() 1 0.1% 0.1% node::LibuvStreamWrap::ReadStart()::{lambda(uv_stream_s*, long, uv_buf_t const*)#2}::_FUN(uv_stream_s*, long, uv_buf_t const*) 1 0.1% 0.1% node::CustomBufferJSListener::OnStreamRead(long, uv_buf_t const&) 1 0.1% 0.1% node::AsyncWrap::EmitTraceEventBefore() 1 0.1% 0.1% mprotect 1 0.1% 0.1% getpid 1 0.1% 0.1% cfree 1 0.1% 0.1% __lll_lock_wait ```

master:

``` [C++]: ticks total nonlib name 155 6.5% 14.0% std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() 88 3.7% 7.9% __GI___pthread_mutex_lock 87 3.7% 7.8% v8::ArrayBuffer::GetBackingStore() 71 3.0% 6.4% __GI___pthread_mutex_unlock 54 2.3% 4.9% void node::Buffer::(anonymous namespace)::StringSlice<(node::encoding)1>(v8::FunctionCallbackInfo const&) 36 1.5% 3.2% node::Buffer::(anonymous namespace)::ParseArrayIndex(node::Environment*, v8::Local, unsigned long, unsigned long*) 32 1.4% 2.9% v8::internal::Builtin_TypedArrayPrototypeBuffer(int, unsigned long*, v8::internal::Isolate*) 31 1.3% 2.8% v8::Value::IntegerValue(v8::Local) const 27 1.1% 2.4% node::native_module::NativeModuleEnv::CompileFunction(v8::FunctionCallbackInfo const&) 27 1.1% 2.4% epoll_pwait 26 1.1% 2.3% v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) 20 0.8% 1.8% v8::Value::IsArrayBufferView() const 20 0.8% 1.8% __libc_read 19 0.8% 1.7% __pthread_cond_signal 16 0.7% 1.4% node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local*) 13 0.5% 1.2% v8::ArrayBufferView::Buffer() 10 0.4% 0.9% v8::internal::FixedArray::set(int, v8::internal::Object) 6 0.3% 0.5% v8::internal::libc_memset(void*, int, unsigned long) 6 0.3% 0.5% v8::internal::RuntimeCallTimerScope::RuntimeCallTimerScope(v8::internal::Isolate*, v8::internal::RuntimeCallCounterId) 5 0.2% 0.5% v8::internal::libc_memmove(void*, void const*, unsigned long) 5 0.2% 0.5% v8::ArrayBufferView::ByteLength() 4 0.2% 0.4% write 4 0.2% 0.4% void node::StreamBase::JSMethod<&node::StreamBase::WriteBuffer>(v8::FunctionCallbackInfo const&) 4 0.2% 0.4% do_futex_wait.constprop.1 4 0.2% 0.4% __lll_lock_wait 3 0.1% 0.3% node::binding::GetInternalBinding(v8::FunctionCallbackInfo const&) 3 0.1% 0.3% _init 2 0.1% 0.2% v8::BackingStore::Data() const 2 0.1% 0.2% fwrite 2 0.1% 0.2% cfree 2 0.1% 0.2% __lll_unlock_wake 2 0.1% 0.2% __libc_malloc 1 0.0% 0.1% v8::internal::DeclarationScope::DeclareDefaultFunctionVariables(v8::internal::AstValueFactory*) 1 0.0% 0.1% v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) 1 0.0% 0.1% v8::internal::AstValueFactory::GetOneByteStringInternal(v8::internal::Vector) 1 0.0% 0.1% std::num_put > >::do_put(std::ostreambuf_iterator >, std::ios_base&, char, long) const 1 0.0% 0.1% operator new[](unsigned long) 1 0.0% 0.1% node::contextify::ContextifyContext::CompileFunction(v8::FunctionCallbackInfo const&) 1 0.0% 0.1% node::TCPWrap::Connect(v8::FunctionCallbackInfo const&) 1 0.0% 0.1% node::Environment::RunAndClearNativeImmediates(bool) 1 0.0% 0.1% node::AsyncWrap::EmitTraceEventBefore() 1 0.0% 0.1% mprotect 1 0.0% 0.1% __pthread_cond_timedwait 1 0.0% 0.1% __fxstat 1 0.0% 0.1% __GI___pthread_getspecific ```

As you will see, master has these additional items at the top of the list:

    155    6.5%   14.0%  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
     88    3.7%    7.9%  __GI___pthread_mutex_lock
     87    3.7%    7.8%  v8::ArrayBuffer::GetBackingStore()
     71    3.0%    6.4%  __GI___pthread_mutex_unlock

Is there some way we can avoid this slowdown?

addaleax commented 4 years ago

@mscdex Can you provide the output uname -a (what the issue template asks for for Platform:)? Is this on arm/arm64/x86/x64/something else? That’s probably relevant here.

mscdex commented 4 years ago

Added.

mscdex commented 4 years ago

Additionally, I'm using gcc 7.4.0 to compile if that matters.

mmarchini commented 4 years ago

Can you try #32116? If this is some regression on V8, it might already be fixed (or it might be worse on 8.1, but let's hope not).

addaleax commented 4 years ago

The prof output makes it seem like the reason is that std::shared_ptr as we compile it on x64 seems to not use atomic instructions, but rather mutexes. I don’t quite know why it does that, but it’s probably something that can be addressed by passing compile flags. (That might be ABI-breaking but we could do it for v14.x, see https://github.com/nodejs/node/issues/30786 for a related ARM issue.)

Or it’s unrelated to that, but that’s not what the prof output says.

mscdex commented 4 years ago

@mmarchini The V8 8.1 branch has the same performance as master.

addaleax commented 4 years ago

(__gnu_cxx::_Lock_policy)2 is _S_atomic, though, so … not quite sure that that’s actually responsible for the mutex lock/unlock calls.

mscdex commented 4 years ago

I've also now tried:

Compiling v12.16.1 locally (was using the binary from the website before)
Compiling master with gcc 9.2.1
Compiling master with -march=native
Removing calls to GetBackingStore() where possible
Reverting to GetContents() in places

All changes made little to no difference.

mmarchini commented 4 years ago

@mscdex do you see similar results with any of the benchmarks from benchmark/buffers? Maybe some benchmark similar to your private code?

mscdex commented 4 years ago

@mmarchini Yes, for example:

                                                                       confidence improvement accuracy (*)   (**)  (***)
 buffers/buffer-tostring.js n=1000000 len=1024 args=0 encoding='utf8'        ***    -31.66 %       ±0.42% ±0.55% ±0.72%

mscdex commented 4 years ago

The --prof-process output for that buffer-tostring benchmark shows basically the same top results as I showed originally.

puzpuzpuz commented 4 years ago

If that helps, I can also see the same performance degradation on 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux, gcc 7.5.0:

$ node benchmark/compare.js --old ../node-v12.x/node --new ./node --filter buffer-tostring --runs 10 --set len=1024 --set encoding=utf8 buffers | Rscript benchmark/compare.R
[00:00:25|% 100| 1/1 files | 20/20 runs | 3/3 configs]: Done
                                                                       confidence improvement accuracy (*)   (**)  (***)
 buffers/buffer-tostring.js n=1000000 len=1024 args=0 encoding='utf8'        ***    -34.37 %       ±1.24% ±1.70% ±2.32%
 buffers/buffer-tostring.js n=1000000 len=1024 args=1 encoding='utf8'        ***    -31.29 %       ±0.59% ±0.82% ±1.12%
 buffers/buffer-tostring.js n=1000000 len=1024 args=3 encoding='utf8'        ***    -32.47 %       ±0.98% ±1.34% ±1.83%

mmarchini commented 4 years ago

@addaleax I think there's a mix of atomics and mutexes locks going on:

img-2020-03-16-154839

And looking at V8 source code, they explicitly use mutexes (backing-store.cc#L518, mutex.h#L292, mutex.cc#L14-L32).

Looking at git blame and git history, the perf regression was likely introduced on v8/v8@b6b7de0d60b1dc4b72e953828e1d8d7333d1f6ce, but I'm not entirely sure. If this is a V8 issue, it would be good to have a more isolated repro available.

Edit: The third time the commit above was reverted, it was because V8 identified a performance regression. Maybe the regression was not fixed yet?

mscdex commented 4 years ago

I'm not sure if there's a way to trigger it without using any node-specific API, but here is a more minimal example that should still reproduce the issue:

'use strict';

const buf = Buffer.alloc(100, 'a');
console.time('slice');
for (let i = 0; i < 1e7; ++i)
  buf.utf8Slice(0, 32);
console.timeEnd('slice');

mmarchini commented 4 years ago

Sorry, should've been more specific, a V8-only repro.

Or maybe a small Node.js repro is enough in this case? cc @nodejs/v8

mscdex commented 4 years ago

@mmarchini Right, like I said I don't know that that's possible because looking through the V8 source, there are not many places that call GetBackingStore() directly or indirectly and those instances that do exist are more part of the C++ API and not methods that would be triggered from JS land.

mmomtchev commented 3 years ago

The problem is not specific to Buffer, ArrayBuffer uses the new GetBackingStore() too and has the same problem. Reverting to GetContents() won't help because the new GetContents() has been built around the new GlobalBackingStoreRegistry and it is essentially the same function as GetBackingStore(). If you short most of GlobalBackingStoreRegistry::Register(), you will recover most of the lost performance.

Same example as @mscdex but with ArrayBuffer, still the same performance drop

'use strict';

const len = 512;
const buf = new ArrayBuffer(len);
const view = new Int32Array(buf);
console.time('toString ' + len);
for (let i = 0; i < 1e6; i++) {
    view.slice(0, 32).toString();
}
console.timeEnd('toString ' + len);

mscdex commented 3 years ago

Is it possible Node could avoid at least some of these regressions by safely caching the result of GetBackingStore()?

mmomtchev commented 3 years ago

I don't think it would be a good idea. Besides, even if you solve the problem for this artificial benchmark, this won't apply to the real world apps

mscdex commented 3 years ago

@mmomtchev I was thinking more along the lines of caching it at the time of Buffer construction (or lazily caching it), which would apply to everything. However I'm not sure if the value returned by GetBackingStore() can change during the lifetime of the Buffer.

mmomtchev commented 3 years ago

@mscdex, yes, I was thinking about caching the returned value

There is a Google document on the new BackingStore framework: https://docs.google.com/document/d/1sTc_jRL87Fu175Holm5SV0kajkseGl2r8ifGY76G35k/edit#heading=h.5irk4csrpu0y and a discussion here: https://bugs.chromium.org/p/v8/issues/detail?id=9908 Keeping a separate copy of the pointers will come to re-implementing their work (with less features for faster performance) in Node I think that the V8 team should solve this on their side - especially since ArrayBuffer is impacted too The problem is not that severe, this is probably the worst case scenario. For Buffers up to 16 bytes, the function call mechanism is the leading CPU hog, so there is almost no difference between Node 12 and Node 14 For Buffers of 32 bytes to 64 bytes the difference is about 30% Then as you go further up, the GetBackingStore() part is less and less important

mmomtchev commented 3 years ago

In fact, when thinking again, there is no reason why very small Buffers are not impacted - I am looking at the numbers and I wonder if it has something to do with the JIT - I don't really have an explanation

ledbit commented 3 years ago

@mmomtchev just ran into this performance issue too and it seems like there are two issues at play:

Buffer.write call overhead has increased (due to GetBackingStore during 12.x) - impacting the short writes more. Handwriting short strings to Buffer can be up to 20x faster (see this gist for more)
Buffer.write performance seems to have improved for strings >100 bytes - which might cause some of the benchmarks to miss the perf degradation.

Now, for cases where the buffer is allocated using Buffer.alloc(N) which ultimately ends up calling AllocatedBuffer::AllocateManaged - wouldn't it make sense for the Buffer instance to hold a reference to the BackingStore that way all the Buffer methods (there's quite a few of them) that need access to it won't pay GetBackingStore overhead (cc @addaleax )

addaleax commented 3 years ago

wouldn't it make sense for the Buffer instance to hold a reference to the BackingStore that way all the Buffer methods (there's quite a few of them) that need access to it won't pay GetBackingStore overhead (cc @addaleax )

I guess the question there would be … how would we do that in a way that’s more efficient than calling GetBackingStore?

ledbit commented 3 years ago

@addaleax - is my understanding below correct?

Buffer is effectively a thin wrapper around Uint8Array
All methods of Buffer are called while holding a strong reference to the underlying Uint8Array object
thus the backing store is guaranteed to not be going out of scope while a Buffer method is still running

addaleax commented 3 years ago

@addaleax - is my understanding below correct?

Buffer is effectively a thin wrapper around Uint8Array

Not really – Buffer objects are Uint8Arrays, only with a different (subclassed) prototype.

All methods of Buffer are called while holding a strong reference to the underlying Uint8Array object

There is no underlying object in that sense, but yes, in order to be able to call a method on an object, you need to hold a reference to it.

thus the backing store is guaranteed to not be going out of scope while a Buffer method is still running

Yes, that’s always true.

ledbit commented 3 years ago

Got it, so would it make sense to store the the backing store pointer at c'tor time (via SetPrivate or the like) and use that instead of GetBackingStore?

addaleax commented 3 years ago

@ledbit That’s something one could try out, yeah. Two things to keep in mind:

ArrayBuffer instances can be detached – we can detect that well enough by the buffer having a byteLength of 0 – and in that case we shouldn’t use the backing store
Getting the pointer that way might still turn out to be slow. The only real fast way to access data from JS objects is using internal fields, but that would require being in control of the Uint8Array class, so whatever we pick is going to have some kind of slowdown.

ledbit commented 3 years ago

Getting the pointer that way might still turn out to be slow.

@addaleax you're right. It turns out that storing and accessing fields using S/GetPrivate is just as expensive as using GetBackingStore. Adding a ArrayBuffer::BackingStoreUnsafe achieves the goal, but that's v8 land ... back to square 1.

addaleax commented 3 years ago

@ledbit Just so you know, V8 is generally receptive to performance-related feature requests from our side. If we can be specific about what exactly we want, this is not a total blocker.

ledbit commented 3 years ago

@addaleax the ask in this case would be to provide a method for accessing the backing store, in an explicitly unsafe manner as part of the Buffer methods. I'd be happy to take the lead and work with the V8 team if you can provide some guidance.

void* v8::ArrayBuffer::BackingStoreUnsafe() const {
  i::Handle<i::JSArrayBuffer> obj = Utils::OpenHandle(this);
  return obj->backing_store();
}

which could then be used as part of SPREAD_BUFFER_ARG - (diff below in context)

-  SPREAD_BUFFER_ARG(args.This(), ts_obj);
+  v8::Local<v8::ArrayBufferView> name = args.This().As<v8::ArrayBufferView>();
+  const size_t ts_obj_offset = name->ByteOffset();                            
+  const size_t ts_obj_length = name->ByteLength();                            
+  char* const ts_obj_data = static_cast<char*>(name->Buffer()->BackingStoreUnsafe()) + ts_obj_offset;

Perf observations (Node 14.15.1, time in ms for 1e7 ops) for Buffer.write from this gist, but same should be applicable to these methods: Fill, StringWrite,Copy, SwapXX, StringSlice, CompareOffset, IndexOfString, IndexOfBuffer,IndexOfNumber, EncodeInto

StrLen	BackingStoreUnsafe	SPREAD_BUFFER_ARG
1	795	1443
4	945	1603
8	913	1555
16	957	1671
32	971	1671
64	999	1636
128	1031	1657
256	1086	1583
512	1231	1796

Uzlopak commented 2 years ago

I assume that this also reduced the perf of js-bson.

ncannasse commented 2 years ago

Just FYI, we're hitting a similar issue with our debugger which is based on node and does a lot of small reads in C API using small arrays buffers. The original code was using libffi then we went to directly napi then v8. All three implementations are showing strong speed reductions which we were able to pinpoint to this backing store access when switching node version.

We experiment 60ms lag when querying the backing store which makes it very impractical to the point that we can't upgrade our system to newest nodejs/v8. See https://github.com/microsoft/vscode/issues/138467 for our original issue.

dacevedo12 commented 2 years ago

I'm also experiencing this with node 17... Are there any updates? https://bugs.chromium.org/p/v8/issues/detail?id=10343 seems inactive

marjakh commented 1 year ago

Sorry for the delay!

I added a new V8 API for accessing the raw data pointer more efficiently: https://chromium-review.googlesource.com/c/v8/v8/+/3764341

(This should be included in V8 10.5 when it ships.)

If I modify ArrayBufferViewContents::Read to use the new API, the timing for the example below goes from 1000 ms -> 650 ms in local measurements:

'use strict';
const buf = Buffer.alloc(100, 'a');
console.time('slice');
for (let i = 0; i < 1e7; ++i)
  buf.utf8Slice(0, 32);
console.timeEnd('slice');

I didn't read through all the comments in this bug so I might be missing some nuance; please let me know if there are other regressions we should address on the V8 side.

mscdex commented 1 year ago

Since that change is so minimal, it might make a nice backport to existing supported node branches?

kvakil commented 1 year ago

For those following, I am working on the backport in #43921 and then will put up a series of followup diffs to replace various usages of GetBackingStore()->Data() with Data(): https://github.com/nodejs/node/compare/main...kvakil:node:remove-backingstores

nodejs / node

buffer: ~2x slowdown in master compared to v12.x #32226