Performance studies - Githubissues

staltz commented 2 years ago

Hey @jerive I was about to write some BIPF improvements to this module, but then I was thinking about the overhead of crossing from JS to CPP and thought about making some simple benchmarks.

I created a simple "increment by two" function in C++ like this:

NAPI_METHOD(inc) {
  NAPI_ARGV(1)
  NAPI_ARGV_INT32(numb, 0)
  numb += 2;
  NAPI_RETURN_UINT32(numb)
}

Then ran these benchmarks:

let bipf = require('node-gyp-build')(__dirname);

const obj = {
  inc(x) {
    return x + 2;
  },
};

let res = 0;
console.time('increment JS');
for (let i = 0; i < 1000000; i++) {
  res = obj.inc(i);
}
console.timeEnd('increment JS');

console.time('increment CPP');
for (let i = 0; i < 1000000; i++) {
  res = bipf.inc(i);
}
console.timeEnd('increment CPP');

And the numbers came out as:

increment JS: 2.988ms
increment CPP: 38.592ms

Lets try to get those numbers for CPP down to something competitive with JS. Most likely what's going on here is that the N-API is doing some copying. I want to discover whether we can have zero-copy, somehow. The next thing I'll try is to use V8 APIs.

jerive commented 2 years ago

I read Dominic's comment. Is it somehow conceivable to have jitdb in rust within a certain timeframe that would make V8 implem's maintenance cost too high compared to something solid although maybe more future proof ? If I make myself clear.

staltz commented 2 years ago

I think JITDB in Rust or C++ or Zig is going to be a huge project, and getting it right (fixing bugs and benchmarking it) is going to take a lot of work. I think it should eventually be built, but realistically we're talking about at least 3 months of full time work. That's what it took us to build JITDB (and it includes async-append-only-log).

So I recommend not trying that, unless or until we get budget/resources for many months of full time work, and I'm assuming this doesn't fit into anyone's hobby time.

staltz commented 2 years ago

PS: I rewrote the above "inc" function in V8 C++, and it looks like this:

#include <node.h>

namespace demo {

using v8::Exception;
using v8::FunctionCallbackInfo;
using v8::Isolate;
using v8::Local;
using v8::NewStringType;
using v8::Number;
using v8::Object;
using v8::String;
using v8::Value;

void Inc(const FunctionCallbackInfo<Value>& args) {
  Isolate* isolate = args.GetIsolate();

  double value = args[0].As<Number>()->Value() + 2;
  Local<Number> num = Number::New(isolate, value);

  args.GetReturnValue().Set(num);
}

void Init(Local<Object> exports) {
  NODE_SET_METHOD(exports, "inc", Inc);
}

NODE_MODULE(NODE_GYP_MODULE_NAME, Init)

}  // namespace demo

Benchmark results are:

increment JS: 3.700ms
increment CPP: 28.621ms

A bit better, but still very bad.

staltz commented 2 years ago

I did some profiling on the inc functions and the V8::Number::New takes a big chunk of the time budget. I don't know exactly what JS is doing, but it might be that the JS parser is optimizing functions, making them inline, and maybe it's also directly translating the number operations to even lower levels, like machine code.

I think we are close to saying we can quit this experiment, and maybe we should try to benchmark/profile BIPF (JS) and see if we can do V8 tricks.

jerive commented 2 years ago

So I recommend not trying that, unless or until we get budget/resources for many months of full time work, and I'm assuming this doesn't fit into anyone's hobby time.

:rofl:

jerive commented 2 years ago

I think we are close to saying we can quit this experiment, and maybe we should try to benchmark/profile BIPF (JS) and see if we can do V8 tricks.

Were you thinking of any particular code paths ? It feels, considering that we don't have any JSON schema to map to, that the object/array optimizations described in the article you mentioned cannot really be applied.

I tried the same switch/case optimization that is done in encodingLengthers for decode, but it does strictly nothing in terms of performance.

staltz commented 2 years ago

Timely comment. I just sat down to try some V8 tricks. I think the first step is to get debug information on which functions are being inlined/optimized and which functions are not being inlined, and then step-by-step try to make them all optimizable. So I don't know yet which tricks to try but getting the information out is the 1st step.

staltz commented 2 years ago

In bipf repo: node --trace-opt --trace-opt-stats --trace-deopt test/perf.js

jerive commented 2 years ago

~~This is interesting~~ [bailout (kind: deopt-eager, reason: out of bounds): begin. deoptimizing 0x0a5c6b2d7b61 <JSFunction decode (sfi = 0x2eadd490a709)>, opt id 15, bytecode offset 21, deopt exit 16, FP to SP delta 104, caller SP 0x7ffcc61e89d0, pc 0x7f3592872206] No, it happens after the test

staltz commented 2 years ago

I came here to say the same thing. :sweat_smile:

Mine was

[deoptimizing (DEOPT eager): begin 0x0b5b1b0a10d9 <JSFunction decode (sfi = 0x3f7f9f7a7f09)> (opt #64) @4, FP to SP delta: 96, caller sp: 0x7fff0e1c1068]
            ;;; deoptimize at </home/staltz/oss/bipf/node_modules/.pnpm/varint@5.0.2/node_modules/varint/decode.js:19:12> inlined at </home/staltz/oss/bipf/index.js:228:20>, out of bounds

What I think it means is that it inlined varint/decode inside BIPF's decode function and then for some reason it deoptimized varint/decode. Maybe there's a way to prevent that deopt from happening. Maybe if we tweak the varint code a bit.

jerive commented 2 years ago

https://github.com/P0lip/v8-deoptimize-reasons

staltz commented 2 years ago

I keep seeing Smi and I have no idea what that means. :sweat_smile:

jerive commented 2 years ago

--print-opt-code says a lot of things, for example:

Inlined functions (count = 4)
 0x1ec3e348af79 <SharedFunctionInfo decode_string>
 0x1ec3e348ef71 <SharedFunctionInfo read>
 0x30907c666b39 <SharedFunctionInfo toString>
 0x30907c665359 <SharedFunctionInfo slice>

I keep seeing Smi and I have no idea what that means. sweat_smile

Me neither but found: https://stackoverflow.com/questions/57348783/how-does-v8-store-integers-like-5

jerive / bipf-napi

Performance studies #1