libev engine has inconsistent handling of signals that raise an exception on the main thread

metcalf commented 7 years ago

When sending a signal that raises an exception (e.g SIGTERM) while waiting on a NIO select on the main thread, behavior is either:

Raise the correct SignalException
Fail to raise the correct SignalException but exit with a status code that indicates that SIGTERM was processed at some level.
Fail to raise the correct SignalException and exit with a normal status code.

This only reproduces with libev. My test fails for Java because I'm using fork which isn't supported in JRuby.

You can see this reproduced in: https://github.com/socketry/nio4r/pull/133

metcalf commented 7 years ago

The specific issue is not that the signal causes the selector to wake up, but that the behavior is inconsistent. It seems like something internal to Ruby is getting into an unexpected state if we can exit with a status code indicating a SIGTERM without raising the correct SignalException.

tarcieri commented 7 years ago

I agree that consistent signal handling behavior is desirable. The correct way to handle this in libev would be to wrap its own signal handling behavior:

http://search.cpan.org/~mlehmann/EV-4.21/libev/ev.pod#ev_signal_-_signal_me_when_a_signal_gets_signalled!

Handling forks is especially tricky. From the libev docs:

The special problem of inheritance over fork/execve/pthread_create

Both the signal mask (sigprocmask) and the signal disposition (sigaction) are unspecified after starting a signal watcher (and after stopping it again), that is, libev might or might not block the signal, and might or might not set or restore the installed signal handler (but see EVFLAG_NOSIGMASK).

While this does not matter for the signal disposition (libev never sets signals to SIG_IGN, so handlers will be reset to SIG_DFL on execve), this matters for the signal mask: many programs do not expect certain signals to be blocked.

This means that before calling exec (from the child) you should reset the signal mask to whatever "default" you expect (all clear is a good choice usually).

The simplest way to ensure that the signal mask is reset in the child is to install a fork handler with pthread_atfork that resets it. That will catch fork calls done by libraries (such as the libc) as well.

metcalf commented 7 years ago

I confirmed that I can reproduce this with Process.spawn instead of forking: https://github.com/metcalf/nio4r/commit/ca870e3c036abf969bd45e64b4d53fa2a90943ee

tarcieri commented 7 years ago

The tl;dr: of the libev advice is to install a fork handler (i.e. pthread_atfork()) which resets the signal mask. However this claims it isn't as simple as that:

https://lwn.net/Articles/415684/

From the post, it looks like it should hopefully provide "best effort" treatment of the signal mask, at least on Linux, and should work with posix_spawn() (which I would assume is what Process.spawn invokes, at least on Linux):

1) For the system() call, POSIX says: "It is unspecified whether the handlers registered with pthread_atfork() are called as part of the creation of the child process." In glibc, they aren't. 2) Regarding posix_spawn, POSIX says: "It is implementation-defined whether the fork handlers are run when posix_spawn() or posix_spawnp() is called." In glibc, they are. 3) The linux-specific clone() system-call does not have atfork handlers called.

andrew-stripe commented 7 years ago

I was also able to repro outside the tests where I don't even require 'nio' until after spawning the new process. That leads me to believe that this isn't specific to forking and is instead a timing issue with when the signal is received in any process that uses NIO.

I'm using this for a relatively low-concurrency application so it appears that I could use the pure Ruby backend as a short term mitigation. You mention here that the Ruby backend should only be used under "exceptional circumstances". Do you think it's a reasonable choice for event loops that handle <<100 monitors?

andrew-stripe commented 7 years ago

Ah sorry, commented from my work account andrew-stripe == metcalf

tarcieri commented 7 years ago

You can certainly try to use it, although I'd definitely like to get the libev backend fixed.

ioquatix commented 6 years ago

Just wondering where we are at with this. Do we have a test case to reproduce the bug?

socketry / nio4r

libev engine has inconsistent handling of signals that raise an exception on the main thread #134

The special problem of inheritance over fork/execve/pthread_create