peerigon / phridge

A bridge between node and PhantomJS
The Unlicense
519 stars 50 forks source link

Ignore EPIPE and ECONNRESET errors from stdin #35

Closed SystemParadox closed 8 years ago

SystemParadox commented 9 years ago

Fixes #34.

Thanks.

coveralls commented 9 years ago

Coverage Status

Coverage decreased (-0.43%) to 99.57% when pulling 991600b881adf63a34bf3b7e5b3c52f5f084cac9 on SystemParadox:34-fix-epipe-error into 60545fb4197315ad74b9e03303f2a2c03bcf1042 on peerigon:master.

jhnns commented 9 years ago

Thx for reporting this.

It seems like this is a Linux problem because it does not occur on OSX. My assumption is that Phantomjs receives the SIGINT signal and handles it by itself. Thus it is not necessary that phridge also tries to shutdown the process.

However, I don't like to just silently ignore errors, because that is usually just a quickfix and will haunt you when you try to debug something without getting an error :)

I think I'll rewrite your PR so that phridge only ignores these errors when a SIGINT is received.

Why do you also ignore ECONNRESET? This error does not occur in my test setup...

SystemParadox commented 9 years ago

As discussed in #34, Linux shells tend to group child processes into 'process groups'. When the shell gets a ctrl-c, it doesn't just send it to the current foreground process, it sends it to the entire process group, which means the children receive it as well. If you send the signal with kill -SIGINT this doesn't happen, it's just a shell thing. However, this is an aside- it's just a symptom of a wider problem - which is that phridge throws uncaught EPIPE/ECONNRESET exceptions if the phantomJS process dies.

I agree entirely about not silently ignoring errors :). HOWEVER, in this case I believe that this may quite likely be the correct behaviour. EPIPE/ECONNRESET are fired because the process has died - it's just not been cleaned up because process.on('exit') hasn't fired yet.

This can only be solved in one of the following ways:

  1. Find a way to guarantee that exit fires before EPIPE/ECONNRESET and either clean up so that they never fire or set a flag so that they will be ignored from that point.
  2. Ignore EPIPE/ECONNRESET with the expectation that exit will fire to clean everything up
  3. Treat EPIPE/ECONNRESET/exit all the same. Cleanup, then ignore further events.

Setting to ignore only once a SIGINT is received doesn't handle the case when just phantomJS is killed.

I didn't notice ECONNRESET until I started writing the tests - where I ended up randomly getting EPIPE or ECONNRESET about 50:50. I am not sure whether I have also seen this outside of the mocha test cases or not. That one probably does need more investigation.

Thanks.

SystemParadox commented 9 years ago

I should say that if you wish to reproduce the wider test case, start a node process which runs phridge to create a phantomJS, then kill (SIGTERM, SIGINT, SIGKILL, whatever) the phantomJS, and then allow node to run phantom.dispose(). It causes node to die horribly with EPIPE errors from nowhere.

jhnns commented 9 years ago

Yeah, I guess you're right.

Setting to ignore only once a SIGINT is received doesn't handle the case when just phantomJS is killed.

When is this case? I think, this should never be the case under normal circumstances...

SystemParadox commented 9 years ago

Probably not under normal circumstances. However, if I do killall phantomjs I would expect the node process to spawn another one, not die with an uncaught exception from nowhere.

If a phantomJS process dies, it makes sense for any tasks it was running to return errors. But the errors should be specific to those requests - they shouldn't blow up the whole server.

Possibly phridge or the phantomjs object should also emit some kind of error, but it needs to be a clear, documented, catchable error so that a server-type process can catch it, print a warning, spawn a new instance and carry on.

domasx2 commented 9 years ago

:+1: I get ECONNRESET if phantom process is killed while evaluating a page, this would fix it

jhnns commented 8 years ago

Could you guys check out the current master? I've improved the error handling. Does especially this commit fix your problems?

I don't like to discard errors. That's why I've written an error handler which waits 300ms for the process to shut down. If that's not the case, the error is emitted on the Phantom instance where the developer is able to catch it. If the error is not handled, node throws it.

SystemParadox commented 8 years ago

Glad to see stdin/stdout/etc are now using _onUnexpectedError and rejecting deferreds on the Phantom object instead of throwing errors from undocumented internals. That's a big improvement :)

However, I think the timeout is fragile and confusing and just introduces race conditions.

I really think you should just remove this. To my mind, there are only 3 places when we care about errors:

  1. I am waiting for a promise to resolve and an error occurs
  2. I try to initiate a call and an error occurs
  3. I try to initiate a call but that process has already errored/closed/disposed. If the specific error has already been thrown to a previous listener, this should probably just be the instance is already disposed error. But if the error occured when no-one was listening, it should be stored and thrown to the first person who cares about it.
    • Errors should only be thrown synchronously when calls are made or by rejecting promises. There should never be an uncaught error. We should not be using EventEmitter and expecting the user to listen for 'out-of-band' errors.
    • If an error occurs when no-one is listening, it should be stored and thrown to the first person who cares about it
    • If an error occurs when no-one is listening, and no valid call ever gets made for it to be thrown to, then we are clearly shutting down and don't care
    • dispose() should completely ignore any errors that may occur if the process has already been terminated or is in the process of doing so

Thanks.

jhnns commented 8 years ago

Thx for reading through the source code and for providing feedback. That's why I didn't just publish a new version :grinning:

I have to admit, using a timeout seems a bit hacky, but I don't think that it's too fragile. It basically just says: "Ok, I'll wait 300ms for the process to exit (e.g. because a SIGTERM was received and the process is busy shutting down). In that case we don't want to report anything because the error was kind of expected. After that delay we emit the error and pass responsibility back to the user, because we ahve to assume that it was not expected".

To answer your questions:

What if my process takes longer than 300ms to exit?

Than you have to handle the error event in order to prevent the process to crash. It's just an assumption that most applications are able to exit within 300ms.

What if my process caught the SIGINT and decided not to exit at all?

Again, than you have to handle the error event.

Errors should only be thrown synchronously when calls are made or by rejecting promises. There should never be an uncaught error. We should not be using EventEmitter and expecting the user to listen for 'out-of-band' errors.

I don't think that this is true. There are error conditions without explicit calls, like receiving SIGTERM from "the outside". Using an error event for these situations is kind of nodejs standard.

If an error occurs when no-one is listening, it should be stored and thrown to the first person who cares about it

I don't think that this is a good behavior. The error would be completely out of context. Imho it's best practice to report errors as soon as possible – and within the correct context.

dispose() should completely ignore any errors that may occur if the process has already been terminated or is in the process of doing so

Agreed. The current solution does that already.

SystemParadox commented 8 years ago

There are error conditions without explicit calls, like receiving SIGTERM from "the outside". Using an error event for these situations is kind of nodejs standard.

A SIGTERM/SIGINT/SIGWHATEVER is an event, not an error. If a process is told to exit by the sysadmin, operating system, etc, I don't consider that to be an error. We could be shutting down. It's only an error if you then try to use this process.

The error would be completely out of context. Imho it's best practice to report errors as soon as possible – and within the correct context.

This is my issue with the timeout approach. It feels like the error is out of context, since it's not being thrown immediately (I realise my "storing the error and throwing it to the next caller" approach has the same issue, but at least it was given to someone who cares about it, and likely within the context of a request which will almost certainly have its own error handler rather than crashing the whole process).

If you are going to throw errors EventEmitter-style, this will need to be very clearly documented as something the user has to think about. To me it seems that throwing a timeout into the mix here just makes things more complicated. The purpose of the timeout is to decrease the number of things the user has to think about, so that the user doesn't need to care about random errors occuring for the simple case of SIGINT/SIGTERM to process groups. But we haven't helped the situation for anyone. All we've done is made the error unreliable - we've introduced a race condition. Everyone has to read the documentation and handle the possible error because it could happen in production even if it works fine in development/testing. If the machine is slow or busy, you might get an unexpected crash.

What we've done is taken a possible uncaught error that everyone has to deal with and made it more complicated. Everyone still has to handle it because it still might happen. Except now it's 300ms late and unreliable.

I agree that it would be nice to handle this process group issue for SIGINT/SIGTERM so that most users don't have to think about it, but I don't think there is any way to distinguish this case. This leaves only two options:

  1. Only throw errors which affect a call (either in progress or one made against an errored process)
  2. Clearly document that random uncaught errors may occur when a phantom process is killed, particularly with reference to the process group case when pressing ctrl-c. Everyone has to handle them, and all of us will probably just ignore them or print them as warnings because we can't tell the difference either.
jhnns commented 8 years ago

I agree to all of your arguments. Adding a timeout only makes it easier for the group of users who are lucky to have a process which exits within 300ms :wink:.

What about a mixture of both options? We could emit an unexpectedExit-event which is emitted as soon as PhantomJS is unusable. This way we'll give the user the chance to handle this (rather unlikely) situation immediately. However, sending SIGINT/SIGTERM to the process group won't throw an error anymore.

Of course, this needs to be documented in the README.

SystemParadox commented 8 years ago

An opt-in error handler sounds like a good idea to me. So by default it won't crash the process. Any pending promises will be rejected, and any new calls will throw the "already exited" error. If the process is exiting when the user didn't expect for some reason, they can investigate by listening to the unexpectedExit event.

Am I right in assuming that with this approach you would be removing the timeout and that error events would never be emitted?

When rejecting promises or throwing the "already exited" error, it would be nice to either throw the error that caused it, or possibly better, attach the original error as a property on the error that gets thrown.

Thanks very much :)

jhnns commented 8 years ago

Am I right in assuming that with this approach you would be removing the timeout and that error events would never be emitted?

Error events on childProcess and its streams will still be emitted, but they won't crash the process anymore because phridge attaches a listener for it.

When rejecting promises or throwing the "already exited" error, it would be nice to either throw the error that caused it, or possibly better, attach the original error as a property on the error that gets thrown.

Yes, I'll do that, but I'm afraid they won't be very helpful either, like Error: write EPIPE :wink:

jhnns commented 8 years ago

Just published this with v1.2.0