nodejs / node-v0.x-archive

Moved to https://github.com/nodejs/node
34.42k stars 7.31k forks source link

node.js uses up 100% cpu and doesn't responde #2157

Closed sadtaco closed 12 years ago

sadtaco commented 12 years ago

Using node v0.6.2 http://pastebin.com/sdV8qBDQ is what strace gives me.

http://pastebin.com/H4Sf87Kt is my code. I don't think there is anything in there that could be doing an infinite loop that'd cause something like this, so it seems like a node issue to me.

No one on IRC was able to help and said I should report it here.

ry commented 12 years ago

Thank you, this is very helpful. @bnoordhuis, I think we should remove those EIO loops.

bnoordhuis commented 12 years ago

@ry: I don't know what EIO loops you mean.

@sadtaco: Node is trying to log an error to stderr and it fails. Are you redirecting stderr and if so, to what? Can you hook it up to gdb?

gdb --args node script.js
break src/unix/error.c:85
run
# wait until the breakpoint is hit
backtrace
print sys_errno
ry commented 12 years ago

Sorry, I confused EIO with EINTR. I meant https://github.com/joyent/node/blob/3f862cf1795fb69378489e24ea6f69cbe9897b72/deps/uv/src/unix/stream.c#L398-405

sadtaco commented 12 years ago

http://pastebin.com/jFi7XAFt is what gdb gave me.

It doesn't seem like I'm using the API for 'url' wrong in any way, but maybe I'm wrong there.

bnoordhuis commented 12 years ago
TypeError: Parameter 'url' must be a string, not undefined
    at Object.urlParse [as parse] (url.js:92:11)
    at SocketNamespace.<anonymous> (/home/user/livebb.js:109:20)

That looks like a bug in your code:

var ref = nodeUrl.parse(socket.handshake.headers.referer).hostname +
  nodePath.dirname(nodeUrl.parse(socket.handshake.headers.referer).pathname);

Clients don't necessarily set the Referer header.

But that's unrelated to the EIO errors you were seeing.

sadtaco commented 12 years ago

I had that running with gdb for a few weeks, and no issue. I turn it off and start it via "sudo /node/node file.js > /dev/null &" and within a day it goes into the EIO loop and using up all my server's CPU. :/

isaacs commented 12 years ago

@sadtaco @bnoordhuis Is this still an issue?

sadtaco commented 12 years ago

I have a strong feeling it comes from starting with sudo node.js path &, and when it can no longer output anything it makes that error.

Running it with console kept open works fine, so I think I just need to implement a better "run as daemon" set up.

asaarinen commented 12 years ago

Not absolutely sure if this is related but I am seeing similar 100% cpu usage symptoms

I am running node v0.6.10 in an EC2 micro instance (ubuntu maverick server 10.10), and downloading a 100MB file from S3 to this instance, and each chunk is decrypted and written to another http stream that I am downloading to my local machine

What happens is that because EC2 - S3 connection is fast, my node process downloaded and wrote the file entirely downstream when only 5% of it was downloaded to my local machine, so I believe all of that data resides somewhere in node or OS buffers. And now node is eating 100% CPU and is not responding to any other HTTP requests coming in at all.

When the download was completed to my local machine, node became responsive again.

bnoordhuis commented 12 years ago

@asaarinen: Can you trace your app with strace -o trace.log node app.js and post the log? I've had reports similar to what you describe but I haven't been able to pinpoint it (though I have a gut feeling that it's related to how Amazon schedules VMs).

asaarinen commented 12 years ago

@bnoordhuis: here goes http://pastebin.com/ei7sVHR4

I cut just this piece of the log and as it was running, cpu was at 100% and when 70% of the file was received from S3, 100k of it had come downstream at 10k/sec (eg. much slower than the connection I have).

Regarding Amazon VM sheduling, when using micro instances they slash the cpu if it is constantly spiking at 100%. However I dont think that is causing this because the cpu goes to 100% exactly when I start the download and nothing much else is going on there.

bnoordhuis commented 12 years ago

I don't see anything exceptional in the trace (the part you posted at least). Can you post the output of time node app.js and strace -cf node app.js? Knowing how much time is spent in kernel mode vis-a-vis user mode and the syscall count may give some insights.

asaarinen commented 12 years ago

Spends most of the time in userland, see http://pastebin.com/6CbfeVGJ

I am also thinking if the node process has to buffer a lot and they get swapped to disk or something. However I would assume that would show a lot more time spent in the kernel, and there should be plenty of memory available even if all of the 100MB file was buffered

bnoordhuis commented 12 years ago

I'm reproducing the results here for posterity.

49.93user 3.50system 0:59.39elapsed 89%CPU (0avgtext+0avgdata 719456maxresident)k
0inputs+0outputs (0major+121658minor)pagefaults 0swaps

strace -o trace.log -cf node app.js:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 63.25   13.745354        6142      2238           nanosleep
 23.97    5.209725        3926      1327         1 futex
  5.97    1.296700        1293      1003           read
  2.14    0.464029        9667        48           munmap
  1.86    0.404026        2479       163        16 open
  1.75    0.380024       34548        11           sendto
  0.85    0.184084         856       215        15 write
  <snip>

It's kind of worrisome that nearly 64% of syscall time is spent in nanosleep(). Can you rerun the test with this small patch applied? (That's probably not it but let's be thorough anyway).

diff --git a/deps/uv/src/unix/ev/ev.c b/deps/uv/src/unix/ev/ev.c
index a432bfb..334c7c0 100644
--- a/deps/uv/src/unix/ev/ev.c
+++ b/deps/uv/src/unix/ev/ev.c
@@ -2488,7 +2488,6 @@ ev_run (EV_P_ int flags)

                 if (expect_true (sleeptime > 0.))
                   {
-                    ev_sleep (sleeptime);
                     waittime -= sleeptime;
                   }
               }

If that doesn't fix the throughput, can you please revert the patch and benchmark your app with callgrind?

$ valgrind --tool=callgrind  --callgrind-out-file=callgrind.log node app.js
$ cg_annotate --auto callgrind.log

Note that valgrind/callgrind basically runs your program in a virtual machine so it's really, really slow. Grab another cup of coffee, you'll have plenty time for it. :-)

asaarinen commented 12 years ago

Will try the patch, however in the meantime tried running with node --prof

It gives me the following

Statistical profiling result from v8.log, (247570 ticks, 41 unaccounted, 0 excluded).

 [Shared libraries]:
   ticks  total  nonlib   name
  194785   78.7%    0.0%  b775a000-b775b000
  44965   18.2%    0.0%  /lib/libcrypto.so.0.9.8
    537    0.2%    0.0%  /lib/libc-2.12.1.so
      2    0.0%    0.0%  /usr/lib/libstdc++.so.6.0.14
      1    0.0%    0.0%  /lib/libssl.so.0.9.8

Not exactly sure what to make of it, but some googling leads me to believe that the shared library taking time may be linux-gate.so, which does virtual system calls. If that is the case, again not sure what to think of it as timing indicated most of the time was spent in user space :-/

bnoordhuis commented 12 years ago

Yes, that's the vDSO. It ranking so high is probably the result of calling gettimeofday() or clock_gettime() an awful lot, there's not much else it does.

asaarinen commented 12 years ago

After some investigation I think I've found the culprit and it was 80% me, there was a certain point in the app where the streams were not correctly paused and resumed. The app kept writing to a stream that was returning false from write().

However IMHO it is a bit aggressive behaviour that node then starts consuming 100% cpu when it could be just buffering the overhead data, because now it is the application that needs to do that anyway. Maybe there's a reason to that?

bnoordhuis commented 12 years ago

It's part of the contract of the streams API. Buffering up is almost always a bad idea because it's very easy to chew up lots of memory. stdout in node used to buffer, search the bug tracker to see what kinds of problems that caused.