Open voidc opened 3 years ago
The PR adds support for obtaining PidFds, but you can't actually do anything with them. Do we want additional methods on PidFd
in std to wait, send signals, obtain the /proc
directory etc. or should that be left to 3rd party crates?
I added an unresolved question about using clone3
, given what we found in #89522.
Could we use clone
instead of clone3
?
From https://github.com/rust-lang/rust/pull/77168#discussion_r496933698
The reason that I used
__clone3
is that glibc'sclone()
implementation requires the caller to allocate a stack for the new process (even though the kernel allows passing a null stack pointer to the rawclone
syscall). We could allocate a stack ourself, but I think it would be better to avoid that if at all possible.
It's not safe to call either clone syscall directly. It's possible that the clone
function handles this correctly and makes a consistent child state, but this isn't properly documented anywhere, requires the function to be aware of the argument semantics, and does not seem to be happening in practice. It's an open question on musl what we should do with this; we'll probably end up making it work like fork
eventually, but it's in limbo for now. In short: don't use clone
. Prefer posix_spawn
, and use fork
and execve
when you can't.
In short: don't use clone. Prefer posix_spawn, and use fork and execve when you can't.
The issue is that neither of those approaches provides a pidfd atomically.
@cuviper mentions pidfd_open
as an option but I don't think we can uphold its requirements for race-freedom. Practically speaking using it will still be an improvement over using pids but it still undermines one of the promises of pidfds.
What atomicity requirements would you be concerned about? As the parent of the child you just created, you are the one process who absolutely can know, with no ambiguity due to identifier reuse races, that the pid refers to the child process you created. The only way this can be undermined is by the process's own bad behavior, e.g. calling wait
rather than waitpid
with a pid (which is inherently a resource-lifetime-unsafe operation, so don't do that).
Further: if you want to protect even against that, the child, between fork
and execve
, can open its own pidfd and pass it back to the parent, or otherwise signal to the parent that it's still alive after the pidfd_open
completes in the parent via some communication channel between them. But I don't think that's necessary. Rust should not be using (or supporting the use of) resource-lifetime-unsafe operations like wait
without a target pid.
The only way this can be undermined is by the process's own bad behavior, e.g. calling wait rather than waitpid with a pid (which is inherently a resource-lifetime-unsafe operation, so don't do that).
The standard library doesn't control what user code does. There's nothing forbidding custom signal handlers or calling wait
.
Having a way to handle children that doesn't depend on global program state would be nice.
But I don't think that's necessary. Rust should not be using (or supporting the use of) resource-lifetime-unsafe operations like wait without a target pid.
Perhaps, but currently there's nothing forbidding it and the nix crate exposes exactly that as a safe method. https://docs.rs/nix/0.23.0/nix/sys/wait/fn.wait.html
But yeah, we could file this under the same category as manipulating /proc/self/mem
, it's just one more footgun to be aware of.
Indeed. There are all sorts of system interfaces you can use to violate invariants. The user of the language should understand that if they do that, all bets are off and they get to keep both pieces. I don't see any strong justification for preventing doing it via wait
when you can still do it via process_vm_writev
.
(Coming here from https://github.com/tokio-rs/tokio/issues/3520 because I got bit badly by having two SIGCHLD
handlers in the same process)
I strongly agree with Rich here; most of the benefit is gained if std
just uses pidfd_open()
to start on the plain pid. In particular, that would allow async runtimes like tokio to use it to monitor the child process and avoid installing a global SIGCHLD
handler, which is global mutable state that is often touched by other languages.
Any strong objections to a PR to do that?
When pidfd_open
is used in the parent process it's difficult to guarantee race-freedom, especially in the presence of other runtimes installing sigchild handlers with unknown flags. The foolproof but complex way is to do it in the child process and then send the fd to the parent process over a unix socket as mentioned in a previous comment, similar to the pipe that's used to signal exec failures.
@richfelker
As the parent of the child you just created, you are the one process who absolutely can know, with no ambiguity due to identifier reuse races, that the pid refers to the child process you created.
I'm looking into integrating pidfd support into an event loop library. When I look at the details of pidfd_open()
it really seems like there'd be a race condition, but all I've found on it have been unresolved arguments like this one.
Specifically, what is preventing:
pidfd_open()
with the PID it got in step 2pidfd
for the wrong processThe argument I linked to above espouses using application-specific heuristics to avoid this. But this can't apply to library writers, and still leaves the door open for coincidences (ie. the heuristics match the process that is re-using that PID).
What can a parent do after step 2 to avoid this?
Child completes or is killed PID is re-used for another process
The child process will stay zombie (owning the PID which will not be reused) until the parent calls waitpid()
. This has nothing to do with pidfd_open() at all really, it's how Unix has worked since...well, as long as I know.
Ah thanks, I had completely overlooked that. Indeed the stdlib docs even mention that as a potential exhaustion risk. Oops.
pidfd_open
in the child and then transferring the file descriptor.With that create_pidfd
should now be usable in more environments.
For anyone following along: with #117957 pidfds are now used by by a few more methods. And two bugs have been fixed.
The new pidfd_open + send over unix socket uses fork
AFAIK, it would be great if it can use vfork
for better perf and less memory usage.
We have a posix_spawn
path that uses vfork-like clone() under the hood. But that requires glibc to implement pidfd support in posix_spawn
, which hasn't happend yet
Using vfork
directly already would have issues, it's extremely tricky to use correctly and it doesn't call pthread_atfork
handlers. Though I don't know if we guarantee that they're called on process spawning.
Using clone3
directly would be even more troublesome since we can't use glibc after it's called. We'd have to do raw syscalls.
If @joshtriplett gets his iouring_spawn
work into the kernel then we could use that instead and avoid all those problems.
We have a
posix_spawn
path that uses vfork-like clone() under the hood. But that requires glibc to implement pidfd support inposix_spawn
, which hasn't happend yet
Good to see that they have plan adding that!
Though to use it, I think Rust would have to bump the minimum glibc version? Otherwise it would have to check for availability of the new functionality and fallback to existing implementation.
Using
vfork
directly already would have issues, it's extremely tricky to use correctly and it doesn't callpthread_atfork
handlers. Though I don't know if we guarantee that they're called on process spawning.
Yes but I think it will be the most efficient implementation given that posix_spawn currently does not support pidfd.
Since posix_spawn internally uses clone on linux and vfork on unix, I think it's possible to write it in Rust.
Using
clone3
directly would be even more troublesome since we can't use glibc after it's called. We'd have to do raw syscalls.
Thanks, now I see why this is a problem.
IIRC last time I tried calling clone syscall directly it's really painful, not just the arg passing, but also have to manually setup new stack.
If @joshtriplett gets his
iouring_spawn
work into the kernel then we could use that instead and avoid all those problems.
But it would require an io-uring to be initialised?
It would definitely be reasonable for async runtime like tokio to use it, but for stdlib to use it it would have a heavy initialisation cost to pay and not sure it's reasonable for all use cases?
Since posix_spawn internally uses clone on linux and vfork on unix, I think it's possible to write it in Rust.
That's not necessarily true. glibc can use vfork more easily since it controls the libc state, like glibc can use clone3 internally but user code can't. At the very least in rust we'd have to use the unstable #[ffi_returns_twice]
and there probably are other problems with it such as not touching anything upstack or global state.
But it would require an io-uring to be initialised? It would definitely be reasonable for async runtime like tokio to use it, but for stdlib to use it it would have a heavy initialisation cost to pay and not sure it's reasonable for all use cases?
I assume that setting up a ring once and caching it isn't that expensive, especially in comparison to bringing up a whole process (consider all the stuff that happens after exec
).
Anyway, using vfork is a separate topic. We should discuss this in a new issue or on zulip. The fork + exec code-path already exists and is used for everything that posix_spawn doesn't cover. So this isn't exactly a novel issue.
That's not necessarily true. glibc can use vfork more easily since it controls the libc state, like glibc can use clone3 internally but user code can't. At the very least in rust we'd have to use the unstable
#[ffi_returns_twice]
and there probably are other problems with it such as not touching anything upstack or global state.
Thanks I didn't know ffi_returns_twice
is required.
I assume that setting up a ring once and caching it isn't that expensive, especially in comparison to bringing up a whole process (consider all the stuff that happens after
exec
).
True but I think that it would probably be better if it can be configured and is optional.
The PR adds support for obtaining PidFds, but you can't actually do anything with them. Do we want additional methods on
PidFd
in std to wait, send signals, obtain the/proc
directory etc. or should that be left to 3rd party crates?
I just opened https://github.com/tokio-rs/tokio/issues/6281 and realized that Pidfd has no method.
IMO having methods to wait and send signals is indeed useful, while other more exotic ones (e.g. process_madvice) is better left for third-party crates
glibc just adds support for pidfd_spawn
in glibc 2.39.
Once rust bumps minimum glibc ton2.39, I think we can use it instead, which will probably enable use of vfork again.
We can use it as a weak symbol even before we ever raise our minimum that far, just like we do for spawning with chdir.
Alas, the released pidfd_spawn API lacks functionality that clone3 has, namely the ability to return pid and pidfd at the same time. And pidfd_getpid
requires procfs. This can be worked around by pre-opening /proc/self
before trying to use the new API but it does add some complexity and additional sources of failure.
Alas, the released pidfd_spawn API lacks functionality that clone3 has, namely the ability to return pid and pidfd at the same time. And
pidfd_getpid
requires procfs. This can be worked around by pre-opening/proc/self
before trying to use the new API but it does add some complexity and additional sources of failure.
I'd be open to merging a patch that extends pidfs by adding an ioctl() that returns the pid associated with the pidfd.
Feature gate:
#![feature(linux_pidfd)]
This is a tracking issue for Linux-specific extension methods allowing to obtain process file descriptors for processes spawned with the standard Command API.
Public API
Steps / History
Unresolved Questions
clone3
means we can't safely call libc in the child: https://github.com/rust-lang/rust/issues/89522#issuecomment-944636262pidfd_open
may work, but it has conditions on avoiding pid-recycling races.Child::pidfd(&self)
be removed? It can lead toChild::wait
returning errors instead of a saved exit status ifPidFd::wait
obtains the exit status first, which may be surprising behavior.