tokio-rs / tokio

A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...
https://tokio.rs
MIT License
27.05k stars 2.49k forks source link

Signals do not add SA_ONSTACK which may cause linked Go applications to crash. #3520

Open Firstyear opened 3 years ago

Firstyear commented 3 years ago

Version tokio-util v0.6.3 () tokio v1.2.0 ()

Platform Linux pyrite 5.10.12-1-default #1 SMP Sat Jan 30 19:15:49 UTC 2021 (a3c8888) x86_64 x86_64 x86_64 GNU/Linux

Description

When a module built with the tokio runtime is linked or used by an application with go, the related go module crashes. In this situation, the cause is that the nss_kanidm.so module which is built with tokio 1.2.0 is loaded via nsswitch, and this interfers with the signal handling of the docker daemon causing it to crash.

Feb 13 16:29:57 pyrite dockerd[1242]: fatal error: non-Go code set up signal handler without SA_ONSTACK flag
Feb 13 16:29:57 pyrite dockerd[1242]: runtime stack:
Feb 13 16:29:57 pyrite dockerd[1242]: runtime: unexpected return pc for runtime.sigtramp called from 0x7feba8889553
Feb 13 16:29:57 pyrite dockerd[1242]: stack: frame={sp:0xc000b14348, fp:0xc000b143a0} stack=[0xc000b0c298,0xc000b14698)
Feb 13 16:29:57 pyrite dockerd[1242]: 000000c000b14248:  000000c000b14250  000055df6ca73760 <runtime.throw.func1+0>
Feb 13 16:29:57 pyrite dockerd[1242]: 000000c000b14258:  000055df6e676cc7  0000000000000039
Feb 13 16:29:57 pyrite dockerd[1242]: 000000c000b14268:  000000c000b14288  000055df6ca5c8f1 <runtime.sigNotOnStack+129>
Feb 13 16:29:57 pyrite dockerd[1242]: 000000c000b14278:  000055df6e676cc7  0000000000000039
Feb 13 16:29:57 pyrite dockerd[1242]: 000000c000b14288:  000000c000b14338  000055df6ca5be81 <runtime.sigtrampgo+753>
Feb 13 16:29:57 pyrite dockerd[1242]: 000000c000b14298:  0000000000000011  000000c000b14320
Feb 13 16:29:57 pyrite dockerd[1242]: 000000c000b142a8:  000000c000b14440  0000000000000000
...
Feb 13 16:29:57 pyrite dockerd[1242]: runtime.throw(0x55df6e676cc7, 0x39)
Feb 13 16:29:57 pyrite dockerd[1242]:         /usr/lib64/go/1.13/src/runtime/panic.go:774 +0x74
Feb 13 16:29:57 pyrite dockerd[1242]: runtime.sigNotOnStack(0x11)
Feb 13 16:29:57 pyrite dockerd[1242]:         /usr/lib64/go/1.13/src/runtime/signal_unix.go:578 +0x81

The previous nss_kanidm tokio cargo.toml was:

tokio = { version = "1", features = ["rt", "macros", "sync", "time", "net", "io-util", "signal"] }

nss_kanidm module would make a call to a localhost resolver via a unix domain socket consuming a tokio util codec type. However this required block_on and a tokio run time, so the following code was used in the nss client support module.

pub async fn call_daemon(path: &str, req: ClientRequest) -> Result<ClientResponse, Box<dyn Error>> {
    ...
}

pub fn call_daemon_blocking(
    path: &str,
    req: ClientRequest,
) -> Result<ClientResponse, Box<dyn Error>> {
    let rt = Builder::new_current_thread()
        .enable_all()
        .build()
        .map_err(Box::new)?;
    rt.block_on(call_daemon(path, req))
}

As tokio was built with signal support the call to "enable_all" would cause signal handlers to be added. These signal handlers are managed by signal-hook-registry, and it appears they are not loaded with the sigaction flags for SA_ONSTACK. Go loads it's signal handlers on an alternate stack, meaning any other registered handlers which do not use this flag cause the Go runtime to panic.

Workarounds

Remove the "signal" flag from tokio in Cargo.toml to prevent these being loaded in the call for enable_all.

Possible Resolutions

I think there are a few ways this could be handled.

One is that signal-hook-registry has the capability to call sigaction with SA_ONSTACK in case that Rust + Go is linked in the same situation, but this may require work with the related projects.

Another possible approach is that Builder is very "coarse" with the enable flags. The options today are:

Today enable_io from docs.rs states "Doing this enables using net, process, signal, and some I/O types on the runtime." This is rather "coarse", when the Cargo.toml in tokio shows that signal is not a dependency to net/io (but is for process).

So perhaps tokio could improve the enable_ flags to have more fine grained settings such as enable_net, enable_process (implies signal). This would allow consumers to have finer control to avoid signal handler registration which can also work around the issue.

Thanks,

Darksonn commented 3 years ago

I didn't know that we set a signal handler if you did not directly use the tokio::signal module.

ipetkov commented 3 years ago

Signal handlers are lazily installed the first time a Signal instance is created.

(Note that on Unix platforms tokio's process management will also install a signal handler, in case some other part of the code is using processes)

Firstyear commented 3 years ago

https://github.com/tokio-rs/tokio/blob/master/tokio/src/runtime/driver.rs#L10

When enable_io is used, process, signal and io handlers are all installed. Unless I'm mistaken this is the what I'm asking, if this could be enabled in a more fine-grained manner.

ipetkov commented 3 years ago

@Firstyear

https://github.com/tokio-rs/tokio/blob/master/tokio/src/runtime/driver.rs#L10

This is where the driver is registered, but signal handlers are lazily registered here: https://github.com/tokio-rs/tokio/blob/master/tokio/src/runtime/driver.rs#L10

It seems like something in the application is registering a signal one way or another. If the signal driver is disabled, then that part of the code will panic (although it will help you find out where the registration is happening at least).

I'm not very familiar with SA_ONSTACK, but maybe there is a way we can allow the signal driver to be configured to use it

psFried commented 3 years ago

I found this issue after running into the same problem with a Rust library that's being called from Go. I wrote a little test program and was able to confirm that the signal handler for SIGCHLD is installed eagerly when the process and signal features are enabled, even when the program does nothing but sleep, and exit.

Here's the main.rs from the test program:

#[tokio::main]
async fn main() {
    tokio::time::sleep(std::time::Duration::from_secs(600)).await;
}

I ran that, and then followed these instructions while the program was sleeping. With the dependency tokio = {version = "1", features = ["process", "signal", "rt-multi-thread", "time", "macros"]}, I get the following output from cat /proc/<pid>/status:

SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180010440

The SigCgt above indicates that it is catching signal 17 (SIGCHLD). When I remove the process and signal features, then the output shows SigCgt: 0000000180000440, indicating that it's not catching SIGCHLD.

I don't know whether that should be considered a bug or not, or if it was intentional. I could see an argument either way. But I did confirm that with tokio 0.2.25 this signal handler is not installed eagerly, but with 0.3.7 it is.

Firstyear commented 3 years ago

I think that fine-grained enable_ would be a useful way to resolve this as it is possible for a crate/package to want to have signal + io enabled, but differing runtimes with or without those features.

I have noticed that if any crate in my dependency chain requests signal then it "taints" all the way back to my crate even though I do not have the signal feature enabled. So fine grained enable/access would probably help here anyway.

ipetkov commented 3 years ago

I wrote a little test program and was able to confirm that the signal handler for SIGCHLD is installed eagerly when the process and signal features are enabled, even when the program does nothing but sleep, and exit.

Ah yes you're right! I had forgotten that the process driver will eagerly install a SIGCHLD handler which is used for all spawned processes. That's likely the culprit of the original issue. I agree that a finer grained configuration of the runtime could prevent such eager registrations

Firstyear commented 3 years ago

https://github.com/tokio-rs/tokio/pull/3521 <<-- my PR does this, but it's rough, likely not ready for merge at all. The names probably aren't great either. Maybe it's a start you can build from or you can advise on how to clean it up?

carllerche commented 3 years ago

Thanks for posting your issue here.

Could you elaborate on why omitting the signal feature flag doesn't work for your case? This is the intended strategy for disabling signal handling.

Firstyear commented 3 years ago

Because another part of my application does require signal handling.

ipetkov commented 3 years ago

Because another part of my application does require signal handling.

Wouldn't this part of the application turn signal handling back on in the runtime and bring you back to the same issue? I'm not super familiar with SA_ONSTACK, but wonder if there is a way to teach the runtime how to use it and have the application opt-into such a configuration

Firstyear commented 3 years ago

@ipetkov It's actually worse than that. What it is is that this part of the application sets up one tokio runtime without signal handling in the pam/nsswitch modules. But because we link to another crate which also has tokio, and that does have the signal feature, it taints through and activates it in this part of the application. IE if you have A without signal, and it depends on B that does have signal, A has signal activated.

That's why I implemented https://github.com/tokio-rs/tokio/pull/3521 because this allows the runtime in A to select what it needs, even though the tokio features have been tainted through the dependency on B.

I'm not really invested enough to try to solve the SA_ONSTACK problem, but certainly it has value for tokio in being able to better cooperate with languages like go that do set SA_ONSTACK.

cgwalters commented 3 years ago

I'm here because I found out that tokio is installing a global SIGCHLD handler, which in my case conflicts with the one from glib.

Now this is not a new problem, and as of relatively recently pidfd exists on Linux to fix this since you get a file descriptor that one can just add to epoll etc. Looks like there's an async-pidfd crate. Though, this is also something that the standard library could benefit from...ah and now that I look, there's this tracking issue and some recent attempts to use it.

cgwalters commented 3 years ago

I think discussing SA_ONSTACK here is a bit of a red herring - the real issue is that Tokio installing a SIGCHLD handler at all may cause whatever else in the process is doing so to lose notifications.

OK wait, no apparently I'm wrong, Go (naturally now that I think about it) doesn't use SIGCHLD as far as I can tell, it just uses goroutines that block in waitpid(). Each call to cmd.Wait() ends up around here.

But that said, either way it's still process global state; the best long term solution is to get away from that. On Linux, that's pidfds as mentioned above. Unfortunately I need to care about RHEL8 for quite a while, and I'm sure I'm not the only one needing something that works on pre-pidfd Linux (or other Unix I guess).

In the case of a multithreaded Tokio runtime, we could perhaps follow Go's model and basically do a thread per child process. Maybe as some sort of opt-in tokio::runtime::Builder::threaded_waitpid()? That'd work for everyone who owns the end binary linking in tokio alongside something else.

cgwalters commented 3 years ago

In the case of a multithreaded Tokio runtime, we could perhaps follow Go's model and basically do a thread per child process. Maybe as some sort of opt-in tokio::runtime::Builder::threaded_waitpid()?

In case anyone else comes here, one can also work around this by avoiding using Tokio's child process APIs and instead just use the ones from std in a thread. See this PR: https://github.com/containers/containers-image-proxy-rs/pull/13 which was a quite seamless change in that case.

ipetkov commented 3 years ago

the real issue is that Tokio installing a SIGCHLD handler at all may cause whatever else in the process is doing so to lose notifications.

I'm curious to understand what may be happening here. Tokio uses the signal-hook crate to install a OS signal handler, and it should take care of forwarding the signal dispatch to any other previous signal handler, as expected. (At least this was the case when I looked at this last and I don't believe anything should have changed there).

cgwalters commented 3 years ago

Tokio uses the signal-hook crate to install a OS signal handler, and it should take care of forwarding the signal dispatch to any other previous signal handler, as expected.

Ah, I see. (reads) - that's some nice code.

In my case the GLib code happens to run after tokio, and the GLib code is making no attempt to preserve the previous handler. Hmm...that's obviously doable in the same way that signal-hook is. May look at that, but I'd need to wait a while before I could rely on it. And now that I think about it, I could likely work around this by spawning a dummy process before using tokio too.

OK so I take my earlier comments back, this issue is then somewhat specific to Go's use of SA_ONSTACK.

But the above around using pidfd etc. on Linux I think still stands as the right fix at least on Linux.

ipetkov commented 3 years ago

In my case the GLib code happens to run after tokio, and the GLib code is making no attempt to preserve the previous handler.

The Unix signal APIs are far from ergonomic, though chaining signal handlers is the "expected" behavior one should follow. I find it unfortunate (and frankly surprising) that the GLib will clobber this outright...

That said, I have no problems with exploring a pidfd based implementation for Tokio and I'd be happy to review any such efforts! The tricky bit will be implementing a form of feature detection and falling back to OS signals if pidfd is not available (Tokio does not have a supported-kernel-version policy, and requiring that pidfd is available could be a breaking change for some users).

cgwalters commented 3 years ago

Re pidfd, I think it makes sense to wait until it lands in std; there is extensive discussion in the tracking issue https://github.com/rust-lang/rust/issues/82971 - in particular it seems to have gotten hung up on issues around going behind libc's back for clone3, but I may try to see if we can restart things there around just using pidfd_open() if available.

NobodyXu commented 10 months ago

6152 adds pidfd support to tokio and use pidfd instead of signal in process::Child::wait, I think it might fix this issue.

Darksonn commented 10 months ago

We could still register a signal handler if the fallback is used. Furthermore, tokio::signal also registers signal handlers.

NobodyXu commented 10 months ago

You are right, this issue is more general than just tokio::process.