Open metcalf opened 7 years ago
The specific issue is not that the signal causes the selector to wake up, but that the behavior is inconsistent. It seems like something internal to Ruby is getting into an unexpected state if we can exit with a status code indicating a SIGTERM without raising the correct SignalException.
I agree that consistent signal handling behavior is desirable. The correct way to handle this in libev would be to wrap its own signal handling behavior:
Handling forks is especially tricky. From the libev docs:
The special problem of inheritance over fork/execve/pthread_create
Both the signal mask (sigprocmask) and the signal disposition (sigaction) are unspecified after starting a signal watcher (and after stopping it again), that is, libev might or might not block the signal, and might or might not set or restore the installed signal handler (but see EVFLAG_NOSIGMASK).
While this does not matter for the signal disposition (libev never sets signals to SIG_IGN, so handlers will be reset to SIG_DFL on execve), this matters for the signal mask: many programs do not expect certain signals to be blocked.
This means that before calling exec (from the child) you should reset the signal mask to whatever "default" you expect (all clear is a good choice usually).
The simplest way to ensure that the signal mask is reset in the child is to install a fork handler with pthread_atfork that resets it. That will catch fork calls done by libraries (such as the libc) as well.
I confirmed that I can reproduce this with Process.spawn
instead of forking:
https://github.com/metcalf/nio4r/commit/ca870e3c036abf969bd45e64b4d53fa2a90943ee
The tl;dr: of the libev advice is to install a fork handler (i.e. pthread_atfork()
) which resets the signal mask. However this claims it isn't as simple as that:
https://lwn.net/Articles/415684/
From the post, it looks like it should hopefully provide "best effort" treatment of the signal mask, at least on Linux, and should work with posix_spawn()
(which I would assume is what Process.spawn
invokes, at least on Linux):
1) For the system() call, POSIX says: "It is unspecified whether the handlers registered with pthread_atfork() are called as part of the creation of the child process." In glibc, they aren't. 2) Regarding posix_spawn, POSIX says: "It is implementation-defined whether the fork handlers are run when posix_spawn() or posix_spawnp() is called." In glibc, they are. 3) The linux-specific clone() system-call does not have atfork handlers called.
I was also able to repro outside the tests where I don't even require 'nio'
until after spawning the new process. That leads me to believe that this isn't specific to forking and is instead a timing issue with when the signal is received in any process that uses NIO.
I'm using this for a relatively low-concurrency application so it appears that I could use the pure Ruby backend as a short term mitigation. You mention here that the Ruby backend should only be used under "exceptional circumstances". Do you think it's a reasonable choice for event loops that handle <<100 monitors?
Ah sorry, commented from my work account andrew-stripe == metcalf
You can certainly try to use it, although I'd definitely like to get the libev backend fixed.
Just wondering where we are at with this. Do we have a test case to reproduce the bug?
When sending a signal that raises an exception (e.g SIGTERM) while waiting on a NIO
select
on the main thread, behavior is either:This only reproduces with libev. My test fails for Java because I'm using
fork
which isn't supported in JRuby.You can see this reproduced in: https://github.com/socketry/nio4r/pull/133