newrelic / node-newrelic

New Relic Node.js agent code base. Developers are welcome to create pull requests here, please see our contributing guidelines. For New Relic technical support, please go to http://support.newrelic.com.
Apache License 2.0
967 stars 395 forks source link

Error: This socket is closed. #76

Closed Prinzhorn closed 10 years ago

Prinzhorn commented 10 years ago

I'm using New Relic on Heroku. An Express app with MongoDB and redis.

Yesterday I did a small code change to let my Express app handle dyno restarts gracefully. What I added was:

process.once('SIGTERM', function() {
    console.info('Received SIGTERM. Closing http connection.');

    httpServer.close(function() {
        console.info('Successfully closed the http connection.');
    });
});

before that I didn't handle SIGTERM at all and node would just exit the process no matter what.

Now when I restart a dyno, I get get dozens, if not hundreds of these logs entries before the dyno exits.

Error: This socket is closed.
    at Socket._write (net.js:635:19)
    at doWrite (_stream_writable.js:221:10)
    at writeOrBuffer (_stream_writable.js:211:5)
    at Socket.Writable.write (_stream_writable.js:180:11)
    at Socket.write (net.js:613:40)
    at Logger._emit (/app/node_modules/newrelic/node_modules/bunyan/lib/bunyan.js:693:16)
    at Logger.trace (/app/node_modules/newrelic/node_modules/bunyan/lib/bunyan.js:731:10)
    at ErrorTracer.add (/app/node_modules/newrelic/lib/error.js:144:12)
    at ErrorTracer.monitor (/app/node_modules/newrelic/lib/error.js:179:10)
    at Tracer.monitor (/app/node_modules/newrelic/lib/transaction/tracer.js:237:28)
    at Collection.wrapSegmentInvocation [as insert] (/app/node_modules/newrelic/lib/transaction/tracer.js:142:29)
    at NativeCollection.(anonymous function) [as insert] (/app/node_modules/mongoose/lib/drivers/node-mongodb-native/collection.js:129:21)

Note: I'm not sure if this is a New Relic issue or not. I'm posting this here since the newrelic module appears in the stack trace and I guess it has to do with newrelic trying to log the MongoDB action.

othiym23 commented 10 years ago

Thanks for the report! Just out of curiosity, what happens if you take New Relic out of your app? (If you're using the New Relic API, you can disable the module by setting NEW_RELIC_ENABLED to false in your dyno config.) I could be wrong, but it looks like what's happening is that Mongoose is still trying to do MongoDB operations as the process is going down and that's failing for some reason.

Also, what versions of node, mongoose, and the New Relic module are you running?

Prinzhorn commented 10 years ago

Seems to be a Heisenbug...:frowning:

I've run three heroku ps:restart within some minutes and none of them triggered the above (with newrelic still included)...

Do you know if the above (using close()) is a good practice? Does it matter? Basically I want clients which just made a request to be able to finish it, but stop accepting new clients. That's why I don't close DB connections, because they could be still needed.

node 0.10.21 mongoose 3.6.20 newrelic 1.0.0

tbjers commented 10 years ago

I think that when you use httpServer.close() you close sockets for the current process, so the newrelic module is trying to use a closed socket to report back to the API.

Prinzhorn commented 10 years ago

I think that when you use httpServer.close() you close sockets for the current process, so the newrelic module is trying to use a closed socket to report back to the API.

Why would newrelic use my http server for reporting to the API? It's a http.Server, newrelic needs a http.ClientRequest. At least I guess so, just trying to understand what you mean.

tbjers commented 10 years ago

They share a socket from the socket pool, I think.

Prinzhorn commented 10 years ago

They share a socket from the socket pool, I think.

I see. Maybe someone can shed some light on it.

othiym23 commented 10 years ago

Outbound requests use Agents, inbound requests are IncomingMessages, and neither of them use a socket pool. Using server.close will have no effect on the New Relic module's ability to phone home (FYI, this only happens once per harvest cycle, which is a minute long by default, and the New Relic module doesn't use persistent connections, only Keep-Alived HTTP POSTs).

As far as using SIGTERM / SIGINT for clean shutdown is concerned, it's a sensible idea and should achieve your goal (the default signal handler just shuts down the process without waiting for asynchronous tasks to complete), but it may also be causing these errors as an unintentional side effect, because the process will now be waiting for all async stuff to finish before exiting. I think the Bunyan lines in the stack traces might be a red herring – it's mongoose that's complaining, and it looks like something has stale references to a MongoDB connection that's been closed. Is your app using Bunyan itself?

I'm not ruling out our code from being in the mix here, but it's seeming somewhat unlikely right now. Let us know what you discover.

Prinzhorn commented 10 years ago

but it may also be causing these errors as an unintentional side effect, because the process will now be waiting for all async stuff to finish before exiting

Heroku will send a SIGINT (which I don't listen to and node will exit by default) 10 seconds after SIGTERM, if the process didn't exit by then. So I don't need to actually have the process exit, I only give the clients 10 seconds to finish their request.

I've never heard of Bunyan before. Using npm ls bunyan I can see that newrelic is the only module depending on it.

Edit: Given that it didn't happen again so far, let's assume that my code change regarding SIGTERM handling was just coincidence. So maybe leave that aside and reset your mind ;-)

othiym23 commented 10 years ago

I've never heard of Bunyan before. Using npm ls bunyan I can see that newrelic is the only module depending on it.

Cool. Just double-checking.

I think it's possIble that occasionally you have some connections to MongoDB that are hanging around and the SIGINT is interacting with them in a weird way. It also doesn't sound like this is a New Relic-specific problem, so I'm closing this for now. Reopen it if you get any new info that implicates the module, and thanks for the info and trying out our Node support!

Prinzhorn commented 10 years ago

One more thing: I was trying to wrap my head around this stack trace (maybe it's weird due to the newrelic instrumentation):

There are two options: Either it was an error inside the mongo driver (connection issues etc.), which was then logged using bunyan or it was an error while logging something.

Assuming it was an error inside the mongo driver (what you do), why does bunyan appear inside the stacktrace? At the very moment an error happens, an Error instance is created an then logged (passed along different functions). The stack property of the error object won't change, no matter what you do with it. This leads to the conclusion that the error happened while trying to log something (e.g. another error). Then maybe it was then logged by my uncaughtException handler.

I'm kind of confused...looking at the bunyan src it's using s.stream.write(s.raw ? obj : str); at the line were the error occurs, but ... OH, I SEE THE LIGHT.

The socket that is closed is STDOUT

At least that's what I think...I don't know. What else should it be? NEW_RELIC_LOG=stdout

Prinzhorn commented 10 years ago

I'm just guessing here: Heroku pipes the output of my app to sth. else, in order to process the logs. This means process.stdout (https://github.com/joyent/node/blob/613d76ef6a00ec091367de1e63bc51e3ab672cbd/src/node.js#L448) is of type PIPE (https://github.com/joyent/node/blob/613d76ef6a00ec091367de1e63bc51e3ab672cbd/src/node.js#L408)

Now when the dyno gets restarted, heroku closes the pipe. And BANG, the socket is closed.

Edit: I almost created an upstream issue, but https://github.com/trentm/node-bunyan#stream-errors . So it's up to newrelic to handle the stream error.

Edit2: I created an upstream issue, since I just got even more confused https://github.com/trentm/node-bunyan/issues/108

othiym23 commented 10 years ago

This sounds plausible, but that raises some questions (like, how is the stacktrace getting back to us?). @zeke, is this something that the Node buildpacks do?

One thing to keep in mind is that sometimes stacktraces get confused in the V8 side of things, especially when there's as much monkeypatching involved as there is with New Relic. So it's possible (just on the possible side of likely) that one of bunyan or mongoose is showing up on those stacktraces without actually being involved.

zeke commented 10 years ago

It's not a Heroku node buildpack thing, but maybe @tsloughter can shed some light on the Heroku request lifecycle stuff.

Prinzhorn commented 10 years ago

like, how is the stacktrace getting back to us?

That's a very good point indeed. I was watching the stacktrace with heroku logs -tail while doing the restart. They did also end up in logentries.com. They did not show up in New Relic. Maybe someone familiar with Heroku can help.

Prinzhorn commented 10 years ago

It just happened again as I pushed a new update. The very first exception was this

Error: write EPIPE
    at errnoException (net.js:901:11)
    at Object.afterWrite (net.js:718:19)

followed by a ton of This socket is closed. as in my first post.

This aligns perfectly with what I observed in https://github.com/trentm/node-bunyan/issues/108

tsloughter commented 10 years ago

I'm trying to digest all of these posts. A log is going to tail and logentries but not New Relic?

Prinzhorn commented 10 years ago

A log is going to tail and logentries but not New Relic?

Yes sort of, but that's not what this issue is about (it could be that tail and logentries are reading from stderr and the particular log causing the issue was supposed to go to stdout, that's why I see them). I don't want to have this log at all (it's an error caused by newrelic/bunyan).

Basically what this issue is about: the newrelic module does not handle the error event of the process.stdout/stderr streams which it told Bunyan to write to. I would expect it to close/destroy itself and/or the Bunyan instance when those streams close.

othiym23 commented 10 years ago

Heya Tristan, thanks for taking a look at this.

As Alexander mentions, New Relic's node instrumentation is using Bunyan, which is causing a bunch of error messages to be printed out on shutdown because it's trying to write to a closed socket. I'm trying to figure out whether there's some kind of weird monkeypatching issue causing these messages (i.e. this is a Bunyan bug that New Relic needs to be working around, or maybe even a bug in the New Relic module code itself), or if this is due to stdout having been redirected to a named pipe or socket that's been closed by Heroku or foreman because the process is going down. According to @Prinzhorn this only happens intermittently, and this is something we haven't seen ourselves at New Relic. Can you shed any light on this?

Prinzhorn commented 10 years ago

@othiym23 Did you take a look at https://github.com/trentm/node-bunyan/issues/108 ? You'll find everything to reproduce it. Basically you need to attach a error listener to the stream or you'll end up with a global uncaughtException when closing the pipe. And I think it's newrelic's responsibility to do so.

othiym23 commented 10 years ago

I disagree; the New Relic module doesn't open the underlying handle, except passively by using Bunyan's exported interface, and likewise New Relic never closes the handle – we assume that bunyan will continue to log output throughout the life of the process. It is a genuine error if Bunyan can't write to its log, and the correct thing to do at that point is to let it throw to the top-level. We make every effort to make the module safe and unobtrusive, but this is a case in which the module's environment is not in a sane state, and crashing is the best thing to do. Remember that you're only encountering this during shutdown.

Prinzhorn commented 10 years ago

As you said it only happens during shutdown. I'm probably gonna leave everything as it is now. I don't feel 100% comfortable though. Basically this keeps the dyno from shutting down gracefully. If I wouldn't keep the process running after catching the uncaughtException (it's recommended to exit the process) then shutting down would force the process to exit immediately, making the SIGTERM listener useless and preventing the httpServer.close() call to succeed. But that's all very theoretical, since I currently do let the process finish after a uncaughtExceptio.

othiym23 commented 10 years ago

I agree that it's not the best solution, which is why I want to understand what's happening on the Heroku end. This is a thorny edge case:

Given those constraints, is there something you think New Relic could be doing better?

Prinzhorn commented 10 years ago

Given those constraints, is there something you think New Relic could be doing better?

From the top of my head I really don't know. I was hoping someone to come up with an elegant solution. The only thing that bugs me is that I may end up with hundreds of logentries of type "Exception" every day and I need to figure out if any of them is valid. As you said, listening for error on the stream and not logging them isn't the best idea.

Maybe I'll try the following: listening for the close event on both stdout and stderr and inside of that bind the error event and ignore it...well I just tried: I'm not sure if that's a bug, expected or a race condition, but with the following code you will still get an uncaughtException...

process.stdout.once('close', function() {
    process.stdout.on('error', function() {
        //Ignore the error event, since the stream is closed.
    });
});

process.stderr.once('close', function() {
    process.stderr.on('error', function() {
        //Ignore the error event, since the stream is closed.
    });
});

Listening for the error event outside of the close event works as expected.

Does this look like a node core issue? Here's a test case:

node index.js > mypipe

setInterval(function() {
    console.info('hi');
}, 500);

process.stdout.once('close', function() {
    console.error('close');
});

process.stdout.on('error', function(err) {
    console.error(err);
});

Now when closing mypipe this is the output:

{ [Error: write EPIPE] code: 'EPIPE', errno: 'EPIPE', syscall: 'write' }
{ [Error: write EPIPE] code: 'EPIPE', errno: 'EPIPE', syscall: 'write' }
close
[Error: This socket is closed.]
[Error: This socket is closed.]
[Error: This socket is closed.]
[Error: This socket is closed.]
[Error: This socket is closed.]

I would expect the close event to be the first one to fire.

Prinzhorn commented 10 years ago

I've opened a core issue https://github.com/joyent/node/issues/6452

Sorry for spamming so much in here :-)

Prinzhorn commented 10 years ago

I've pushed this into production, it's a workaround for now (Bunyan should check if the stream is destroyed and stop writing to it)

['stdout', 'stderr'].forEach(function(stream) {
    process[stream].on('error', function(err) {
        //Ignore errors which are raised on a closed stream.
        if(process[stream].destroyed) {
            return;
        }

        logger.error(err);
    });
});

This will result in a single "EPIPE" being written to the logs. The second "EPIPE" and all following "This socket is closed" are ignored.

othiym23 commented 10 years ago

I'll keep thinking about better ways to handle errors when writing to stdout or stderr. I'm fine with adding special-case logic onto bunyan in those use cases, but one problem I foresee with the snippet you're using right now is that if the logger is writing to stdout and something else goes wrong with stdout, you might end up with an error cascade of doom as writing to the log causes errors which get handed off the logger to write to the log etc and so on into infinity. You might want to slap a process.nextTick or setImmediate around the call to logger.error to make sure the stream gets a chance to be closed / destroyed on EPIPE or other errors.

Prinzhorn commented 10 years ago

You're right. The only thing worse than endless recursion is endless async recursion, because it can run forever. I wrapped it in nextTick