Closed davepacheco closed 11 years ago
you might as well provide a patch :dart: hehe
Dave, can you try this patch?
diff --git a/src/unix/pipe.c b/src/unix/pipe.c
index b28c8ef..1185b91 100644
--- a/src/unix/pipe.c
+++ b/src/unix/pipe.c
@@ -183,9 +183,6 @@ void uv_pipe_connect(uv_connect_t* req,
uv_strlcpy(saddr.sun_path, name, sizeof(saddr.sun_path));
saddr.sun_family = AF_UNIX;
- /* We don't check for EINPROGRESS. Think about it: the socket
- * is either there or not.
- */
do {
r = connect(uv__stream_fd(handle),
(struct sockaddr*)&saddr, sizeof saddr);
@@ -193,7 +190,8 @@ void uv_pipe_connect(uv_connect_t* req,
while (r == -1 && errno == EINTR);
if (r == -1)
- goto out;
+ if (errno != EINPROGRESS)
+ goto out;
if (new_sock)
if (uv__stream_open((uv_stream_t*)handle,
@@ -213,8 +211,9 @@ out:
req->cb = cb;
ngx_queue_init(&req->queue);
- /* Run callback on next tick. */
- uv__io_feed(handle->loop, &handle->io_watcher);
+ /* Force callback to run on next tick in case of error. */
+ if (err != 0)
+ uv__io_feed(handle->loop, &handle->io_watcher);
/* Mimic the Windows pipe implementation, always
* return 0 and let the callback handle errors.
Perfect: I applied the patch to node master, and I can't reproduce the problem. Thanks!
Thanks for testing, Dave. Landed in v0.8 and master.
In production, I found a Node v0.8.14 program hung trying to connect over a Unix domain socket. According to the kernel state, the socket was connected, but the Node program was not polling on it.
I was able to reproduce this with the following HTTP server:
and the following HTTP client:
The client makes requests to the server over a Unix Domain Socket in a continuous loop. There's an interval timer to detect the hang by noticing when the program has failed to complete any requests within 5 seconds. Since the HTTP server is local and services requests immediately, we should never see a request take anywhere near 5 seconds to complete, and the client should run indefinitely.
On a SmartOS server, I ran the HTTP server and one copy of the HTTP client, and the programs ran for tens of minutes without crashing. Then I started 5 more clients, and within a minute all six of them had dumped core.
I used MDB to examine the core file to see if I could find the pending request object. This core file has 643 requests, but only two of them have either no "response" object or have not emitted "end" on that object:
The second of these looks garbage-collected, but the first one, b5f87bd1, appears to be our hung request:
Notice that the "connecting" field on the socket is "true". Using the "handle" pointer in the socket, with a little work, we can get to the underlying PipeWrap object. Node stores this in the first internal field of the object, which is just after the "elements" field of the JSObject:
(Obviously, we should add an MDB dcmd to make it easier to go directly from the JS handle object to the underlying C++ PipeWrap object.) I don't have CTF (debug information) for Node, so we can't print the PipeWrap object easily, but it's not hard to find the libuv handle inside it by dumping the memory:
Thanks to @rmustacc's help, I was able to load CTF for libuv, allowing us to see the uv_pipe_t:
The important pieces are:
With the watchers uninitialized, it looks like either the program never called uvstream_open, which calls uv__io_set to initialize the watchers, or something uninitialized them afterwards. Not finding any code path that would uninitialize them, I looked for a case where uvstream_open is not called. I guessed that this might happen if the program saw EINPROGRESS from connect(2), and I ran this DTrace script while reproducing the problem to test that hypothesis:
Sure enough, the output looks like this:
and on this system errno 150 is:
So we saw thousands of successful connects while the program ran, followed by one that returns -1 with EINPROGRESS, then the process exited.
Looking at the code, if the connect() call in uv_pipe_connect() returns -1 with EINPROGRESS, the uv__stream_open() call is skipped, and the program returns back to the caller in pipe_wrap.c. As far as I can tell, that caller never adds the new socket to the poll set, so the program never finds out that the UDS has connected. That explains why when I saw this initially, the Node program wasn't polling on the socket, and the program hung indefinitely.
Finally, there's a comment in the code addressing this situation:
If I understand correctly, the comment is justifying not checking for EINPROGRESS by suggesting that connect() on a UDS cannot return that error code. But I cannot find that claim documented anywhere in POSIX or man pages, and obviously illumos-based systems do return EINPROGRESS on UDS sockets in some situations. (Even if that claim were true, an assertion would seem more appropriate here so that the failure mode could be crisp if a system were ever malfunctioning. But as far as I can tell, the kernel is well within its rights to return EINPROGRESS here.)
The fix is that uv_pipe_connect() needs to handle EINPROGRESS from connect(2) appropriately.