Closed sadtaco closed 12 years ago
Thank you, this is very helpful. @bnoordhuis, I think we should remove those EIO loops.
@ry: I don't know what EIO loops you mean.
@sadtaco: Node is trying to log an error to stderr and it fails. Are you redirecting stderr and if so, to what? Can you hook it up to gdb?
gdb --args node script.js
break src/unix/error.c:85
run
# wait until the breakpoint is hit
backtrace
print sys_errno
Sorry, I confused EIO with EINTR. I meant https://github.com/joyent/node/blob/3f862cf1795fb69378489e24ea6f69cbe9897b72/deps/uv/src/unix/stream.c#L398-405
http://pastebin.com/jFi7XAFt is what gdb gave me.
It doesn't seem like I'm using the API for 'url' wrong in any way, but maybe I'm wrong there.
TypeError: Parameter 'url' must be a string, not undefined
at Object.urlParse [as parse] (url.js:92:11)
at SocketNamespace.<anonymous> (/home/user/livebb.js:109:20)
That looks like a bug in your code:
var ref = nodeUrl.parse(socket.handshake.headers.referer).hostname +
nodePath.dirname(nodeUrl.parse(socket.handshake.headers.referer).pathname);
Clients don't necessarily set the Referer header.
But that's unrelated to the EIO errors you were seeing.
I had that running with gdb for a few weeks, and no issue. I turn it off and start it via "sudo /node/node file.js > /dev/null &" and within a day it goes into the EIO loop and using up all my server's CPU. :/
@sadtaco @bnoordhuis Is this still an issue?
I have a strong feeling it comes from starting with sudo node.js path &, and when it can no longer output anything it makes that error.
Running it with console kept open works fine, so I think I just need to implement a better "run as daemon" set up.
Not absolutely sure if this is related but I am seeing similar 100% cpu usage symptoms
I am running node v0.6.10 in an EC2 micro instance (ubuntu maverick server 10.10), and downloading a 100MB file from S3 to this instance, and each chunk is decrypted and written to another http stream that I am downloading to my local machine
What happens is that because EC2 - S3 connection is fast, my node process downloaded and wrote the file entirely downstream when only 5% of it was downloaded to my local machine, so I believe all of that data resides somewhere in node or OS buffers. And now node is eating 100% CPU and is not responding to any other HTTP requests coming in at all.
When the download was completed to my local machine, node became responsive again.
@asaarinen: Can you trace your app with strace -o trace.log node app.js
and post the log? I've had reports similar to what you describe but I haven't been able to pinpoint it (though I have a gut feeling that it's related to how Amazon schedules VMs).
@bnoordhuis: here goes http://pastebin.com/ei7sVHR4
I cut just this piece of the log and as it was running, cpu was at 100% and when 70% of the file was received from S3, 100k of it had come downstream at 10k/sec (eg. much slower than the connection I have).
Regarding Amazon VM sheduling, when using micro instances they slash the cpu if it is constantly spiking at 100%. However I dont think that is causing this because the cpu goes to 100% exactly when I start the download and nothing much else is going on there.
I don't see anything exceptional in the trace (the part you posted at least). Can you post the output of time node app.js
and strace -cf node app.js
? Knowing how much time is spent in kernel mode vis-a-vis user mode and the syscall count may give some insights.
Spends most of the time in userland, see http://pastebin.com/6CbfeVGJ
I am also thinking if the node process has to buffer a lot and they get swapped to disk or something. However I would assume that would show a lot more time spent in the kernel, and there should be plenty of memory available even if all of the 100MB file was buffered
I'm reproducing the results here for posterity.
49.93user 3.50system 0:59.39elapsed 89%CPU (0avgtext+0avgdata 719456maxresident)k
0inputs+0outputs (0major+121658minor)pagefaults 0swaps
strace -o trace.log -cf node app.js:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
63.25 13.745354 6142 2238 nanosleep
23.97 5.209725 3926 1327 1 futex
5.97 1.296700 1293 1003 read
2.14 0.464029 9667 48 munmap
1.86 0.404026 2479 163 16 open
1.75 0.380024 34548 11 sendto
0.85 0.184084 856 215 15 write
<snip>
It's kind of worrisome that nearly 64% of syscall time is spent in nanosleep()
. Can you rerun the test with this small patch applied? (That's probably not it but let's be thorough anyway).
diff --git a/deps/uv/src/unix/ev/ev.c b/deps/uv/src/unix/ev/ev.c
index a432bfb..334c7c0 100644
--- a/deps/uv/src/unix/ev/ev.c
+++ b/deps/uv/src/unix/ev/ev.c
@@ -2488,7 +2488,6 @@ ev_run (EV_P_ int flags)
if (expect_true (sleeptime > 0.))
{
- ev_sleep (sleeptime);
waittime -= sleeptime;
}
}
If that doesn't fix the throughput, can you please revert the patch and benchmark your app with callgrind?
$ valgrind --tool=callgrind --callgrind-out-file=callgrind.log node app.js
$ cg_annotate --auto callgrind.log
Note that valgrind/callgrind basically runs your program in a virtual machine so it's really, really slow. Grab another cup of coffee, you'll have plenty time for it. :-)
Will try the patch, however in the meantime tried running with node --prof
It gives me the following
Statistical profiling result from v8.log, (247570 ticks, 41 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
194785 78.7% 0.0% b775a000-b775b000
44965 18.2% 0.0% /lib/libcrypto.so.0.9.8
537 0.2% 0.0% /lib/libc-2.12.1.so
2 0.0% 0.0% /usr/lib/libstdc++.so.6.0.14
1 0.0% 0.0% /lib/libssl.so.0.9.8
Not exactly sure what to make of it, but some googling leads me to believe that the shared library taking time may be linux-gate.so, which does virtual system calls. If that is the case, again not sure what to think of it as timing indicated most of the time was spent in user space :-/
Yes, that's the vDSO. It ranking so high is probably the result of calling gettimeofday()
or clock_gettime()
an awful lot, there's not much else it does.
After some investigation I think I've found the culprit and it was 80% me, there was a certain point in the app where the streams were not correctly paused and resumed. The app kept writing to a stream that was returning false from write().
However IMHO it is a bit aggressive behaviour that node then starts consuming 100% cpu when it could be just buffering the overhead data, because now it is the application that needs to do that anyway. Maybe there's a reason to that?
It's part of the contract of the streams API. Buffering up is almost always a bad idea because it's very easy to chew up lots of memory. stdout in node used to buffer, search the bug tracker to see what kinds of problems that caused.
Using node v0.6.2 http://pastebin.com/sdV8qBDQ is what strace gives me.
http://pastebin.com/H4Sf87Kt is my code. I don't think there is anything in there that could be doing an infinite loop that'd cause something like this, so it seems like a node issue to me.
No one on IRC was able to help and said I should report it here.