Poor performance when trying to build Node.js integration for Httpaf

Lupus commented 5 years ago

Hey guys,

We're building some internal framework that's going to be based on Httpaf (/cc @seliopou), and I'm trying to make this framework run in Node.js environment. So far my approach is to integrate vanilla Httpaf using it's public API on top of Node.js socket (using gen_js_api for writing Node bindings). I've got it working, js_of_ocaml is really awesome! But I ran into some performance issue, simple "hello world" style HTTP handler can only do 5k request per second on my core i7 laptop. Analogous pure Node.js server shows 12k rps (I understand that js_of_ocaml version with pure-OCaml HTTP processing can't compete with optimized node parser, but <50% of original performance looks way too much for me).

I've tried profiling my app, results in this screenshot:

First item corresponds to this source location within Faraday library:

Third line has less time spent in it:

Is there something particular in that Array.unsafe_get that makes it so expensive?

hhugo commented 5 years ago

Array.unsafe_get should not be expensive. Can you describe how you've built the JavaScript file ? There exists two compilation modes:

separate compilation that does not perform much optimization
whole program compilation (that perform deadcode elimination, inlining, ...)

Lupus commented 5 years ago

I've tried with the below dune file:

(executable
  (name node_sample)
  (public_name node_sample)
  (flags (:standard -g))
  (libraries core_kernel)
  (preprocess (pps lwt_ppx))
  (js_of_ocaml
    (javascript_files ../js_runtime/bigstring_node.js)
    (flags (:standard
      +gen_js_api/ojs_runtime.js
      --pretty
      --no-inline
      --debug-info
      --source-map
      --disable shortvar)))
  (modes byte))

Also I've tried changing all of the options to just --opt 3 but did not see much difference.

Lupus commented 5 years ago

Looking more at the profile, it seems that dequeue_exn is being run in large recursive chain of calls, may the that is the culprit of performance issues, AFAIK Node.js is not doing great at recursion optimizations.

Oh, and magic bigstring_node.js overrides caml_ba_init_views so that Node's Buffer is used instead of Uint8Array. Also it includes helpers to convert between Bigstring.t and Node's Buffer objects. In my understanding that should eliminate some memory copies when casting between Uint8Array and Buffer.

Lupus commented 5 years ago

In my understanding that should eliminate some memory copies when casting between Uint8Array and Buffer.

FYI According to quick test, on this workload using Buffer instead of Uint8Array version gives 4% better performance.

hhugo commented 5 years ago

I've tried with the below dune file:

That doesn't give enough information about whether or not you use separate compilation. I think dune will use separate compilation when using the dev profile (which is the default)

Lupus commented 5 years ago

I think dune will use separate compilation when using the dev profile (which is the default)

Dune's documentation agrees with this statement.

I've tried building with the following command:

dune build node/node_sample.bc.js --profile release -j 1 --verbose

Full build log is attached in case it can help reveal some details:

build.log

Building this way does not show any significat changes in performance compared to --profile development (2309.61 rps in release vs 2267.17 rps in dev)

I'm also attaching build log from second build command I tried:

dune build node/node_sample.bc.js --profile development -j 1 --verbose

build_dev.log

How can I further troubleshoot this?

Lupus commented 5 years ago

I've updated node from 8 to 10 gave 30% improvement to perf, and another 18% from 10 to 12.

Profiler is still unhappy with dequeue_exn function. In chart mode the whole "tower" of request processing ends up in multiple dequeue_exn calls.

Newer node attributes all of this time to whole dequeue_exn function now in source code view:

Looking at the js code for this function, there's nothing particularly interesting inside, aside of exception, which is probably the suspect for low performance...

function dequeue_exn(t)
 { /*<<lib/faraday.ml 98 7>>*/ if
   ( /*<<lib/faraday.ml 98 7>>*/ is_empty(t))
   throw Dequeue_empty;
   /*<<lib/faraday.ml 101 6>>*/  /*<<lib/faraday.ml 101 6>>*/ var
   result=
    t[1][t[2] + 1];
   /*<<lib/faraday.ml 103 6>>*/ t[1][t[2] + 1] = sentinel;
   /*<<lib/faraday.ml 103 6>>*/ t[2] = t[2] + 1 | 0;
   /*<<lib/faraday.ml 103 6>>*/ return result /*<<lib/faraday.ml 104 12>>*/

Lupus commented 5 years ago

/cc @seliopou

Changes in above pull requests to Httpaf and Faraday remove functions in question from top of profiler.

Performance increase in my test is not that impressive though, +5%.

I'll dig this further while this issue can be closed. Thanks @hhugo for your help!

ocsigen / js_of_ocaml

Poor performance when trying to build Node.js integration for Httpaf #835