nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
105.15k stars 28.49k forks source link

Slow start of Node application (ARM) #8259

Closed neophob closed 7 years ago

neophob commented 7 years ago

I use to measure startup time of my application using https://github.com/henvic/require-time. The application runs on a armv7l hardware, no big difference if I use Node v4.4.7 or the latest v6. npm version 3.9.6.

Here are the results:

# ./node_modules/.bin/require-time
1146ms  express
75ms    file-type
2038ms  loggly
4541ms  mailgun-js
1ms     mustache
1ms     q
389ms   request
374ms   winston

I not entirely sure if this is a NPM or a node issue. What I know is, that the issue is NOT related that node does search the files - as I measure the same long startup time when starting my application and using https://github.com/wix/fast-boot.

Here's an strace output when starting my application:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000043           0      2259      1101 stat64
  0.00    0.000000           0       479           read
  0.00    0.000000           0         9           write
  0.00    0.000000           0       640       103 open
  0.00    0.000000           0       534           close
  0.00    0.000000           0         1           unlink
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         8         8 access
  0.00    0.000000           0        50           brk
  0.00    0.000000           0        26         9 ioctl
  0.00    0.000000           0         2           umask
  0.00    0.000000           0         1           readlink
  0.00    0.000000           0        54           munmap
  0.00    0.000000           0         1           ftruncate
  0.00    0.000000           0         4           clone
  0.00    0.000000           0         3           uname
  0.00    0.000000           0        68           mprotect
  0.00    0.000000           0         1           poll
  0.00    0.000000           0        38           rt_sigaction
  0.00    0.000000           0        11           rt_sigprocmask
  0.00    0.000000           0       220           pread64
  0.00    0.000000           0         3           getcwd
  0.00    0.000000           0         1           ugetrlimit
  0.00    0.000000           0       171           mmap2
  0.00    0.000000           0       544         1 lstat64
  0.00    0.000000           0       424           fstat64
  0.00    0.000000           0         2           getuid32
  0.00    0.000000           0         1           getgid32
  0.00    0.000000           0         1           geteuid32
  0.00    0.000000           0         1           getegid32
  0.00    0.000000           0         8           getdents64
  0.00    0.000000           0       461        10 futex
  0.00    0.000000           0         3           epoll_ctl
  0.00    0.000000           0         7           epoll_wait
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           clock_getres
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         1           setsockopt
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         2           dup3
  0.00    0.000000           0         2           pipe2
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0      4893           cacheflush
  0.00    0.000000           0         1           set_tls
------ ----------- ----------- --------- --------- ----------------
100.00    0.000043                 10946      1232 total

any hints?

yjhjstz commented 7 years ago

maybe spend in v8 compile javascript functions.

mscdex commented 7 years ago

What are you comparing to? For example, an ARM device running off an SD card with limited memory isn't really comparable to a modern x86-64 processor with a traditional hard drive or SSD and a larger amount of memory.

neophob commented 7 years ago

Thanks @mscdex, sure the machine is a constraint device, but for example 4s just for th@e require of the mailgun-js package seems to be a bit too long. Looking at iostats does not look like the machine is waiting on any io, so this comment of @yjhjstz might be a good track.

as the require wait times are depend on the package itself I would like to find out why it is so slow. Currently my application needs more than 10s only for the require part.

I know that I cant compare those numbers with my mac book pro, but also there the require show some difference to require the packages.

131ms   express
2ms file-type
180ms   loggly
292ms   mailgun-js
0ms mustache
0ms q
36ms    request
29ms    winston

So long story short:

bnoordhuis commented 7 years ago

Try profiling your applications like this:

$ perf record -e cpu-clock -gi node --perf_basic_prof script.js
$ perf report

That should tell you where CPU time is spent. My guess is that it's going to be V8's parser and code generator.

Experiment with -c <num> with a suitably high number to perf if the default sample frequency is too coarse to capture meaningful data. You can try substituting or extending -e cpu-clock with other events if your hardware has PMUs, see perf evlist.

neophob commented 7 years ago

thanks @bnoordhuis - i did that, here are the top entries of the perf report run:

+   87.66%     0.14%  node  perf-16278.map       [.] Builtin:ArgumentsAdaptorTrampoline                                                                                                                                                      ◆
+   84.84%     0.03%  node  perf-16278.map       [.] LazyCompile:Module._load module.js:271                                                                                                                                                  ▒
+   84.36%     0.00%  node  perf-16278.map       [.] LazyCompile:~Module.load module.js:334                                                                                                                                                  ▒
+   83.89%     0.05%  node  perf-16278.map       [.] LazyCompile:~Module._compile module.js:366                                                                                                                                              ▒
+   83.78%     0.04%  node  perf-16278.map       [.] Builtin:FunctionApply                                                                                                                                                                   ▒
+   81.71%     0.00%  node  perf-16278.map       [.] LazyCompile:~require internal/module.js:11                                                                                                                                              ▒
+   81.55%     0.03%  node  perf-16278.map       [.] LazyCompile:~Module._extensions..js module.js:414                                                                                                                                       ▒
+   77.11%     0.27%  node  perf-16278.map       [.] Stub:CEntryStub                                                                                                                                                                         ▒
+   73.30%     0.00%  node  perf-16278.map       [.] Function:~ node.js:10                                                                                                                                                                   ▒
+   73.18%     0.02%  node  perf-16278.map       [.] Builtin:JSEntryTrampoline                                                                                                                                                               ▒
+   73.16%     0.00%  node  perf-16278.map       [.] LazyCompile:~startup node.js:13                                                                                                                                                         ▒
+   72.77%     0.01%  node  perf-16278.map       [.] Stub:JSEntryStub                                                                                                                                                                        ▒
+   72.29%     0.02%  node  nodejs               [.] v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Ob▒
+   71.79%     0.00%  node  nodejs               [.] node::Start(int, char**)                                                                                                                                                                ▒
+   71.46%     0.00%  node  nodejs               [.] v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*)                                                                                            ▒
+   71.28%     0.00%  node  libc-2.19.so         [.] __libc_start_main

the problem is, that packages with huge dependencies like request (see http://npm.anvaka.com/#/view/2d/request) are responsible for this huge loading time. What I find a bit fishy is that request should fetch each module/bundle only once. however in my example I do have request as dependency in my project, but why does node need so much time reloading the request module again when its used by another dependency? to illustrate my question:

# npm ls

├─┬ request@2.74.0                 -> initial load of the request package, take some time
│ ├── aws-sign2@0.6.0
│ ├── aws4@1.4.1
<snip>
│ │   │   ├─┬ request@2.73.0   -> here request should return the cached module
│ │   │   │ ├── aws-sign2@0.6.0
│ │   │   │ ├── aws4@1.4.1

especially since npm3 flattened the modules. or do I see something wrong here?

molszanski commented 7 years ago

Can we safely say that the disk io wait time is not only one to blame? Somebody knows how to reliably increase/decrease disk/cpu performance and observe the impact?

neophob commented 7 years ago

@molszanski yes disk io is not an issue, checked with iostat, no disk wait. so as @yjhjstz suspected, its the js compile functions that need time.

yjhjstz commented 7 years ago

@neophob maybe you can try custom startup snapshots, refer to http://v8project.blogspot.de/2015/09/custom-startup-snapshots.html

bnoordhuis commented 7 years ago

but why does node need so much time reloading the request module again when its used by another dependency?

Your example has two different versions of request, 2.73.0 and 2.74.0.

Somebody knows how to reliably increase/decrease disk/cpu performance and observe the impact?

Do echo 3 > /proc/sys/vm/drop_caches as root before running the benchmark, that will remove the flakiness of maybe/maybe not having the files cached in memory.

neophob commented 7 years ago

Just a quick recap

@yjhjstz your comment looks interesting, but I assume this feature is not accessible from node, right?

As the perf output shows most of the loading time is spend compiling the js code, are there any other options like a "load unoptimized code now and optimise later" option? or store the optimized code (this goes into the "custom startup snapshots" option)

bnoordhuis commented 7 years ago

are there any other options like a "load unoptimized code now and optimise later" option?

That is unoptimized code you're looking at. :-)

The optimizing compiler doesn't start up until the unoptimized code has been executing for some time (and most of its work it does on a separate thread - not a useful option for the baseline compiler.)

or store the optimized code (this goes into the "custom startup snapshots" option)

That's not really an option for most JS code. Snapshots run in an environment that is very different from the normal runtime.

The vm module has a produceCachedData option that is less restrictive than the snapshotting mechanism (but less restrictions != no restrictions.) YMMV.

bnoordhuis commented 7 years ago

I think there is nothing left to do or discuss here so I'll go ahead and close the issue. Reopen if anyone disagrees.