voidc commented 3 years ago

Feature gate: #![feature(linux_pidfd)]

This is a tracking issue for Linux-specific extension methods allowing to obtain process file descriptors for processes spawned with the standard Command API.

Public API

// std::os::linux::process

pub struct PidFd;

impl PidFd {
   pub fn kill(&self) -> Result<()> {...}
   pub fn wait(&self) -> Result<ExitStatus> {...}
   pub fn try_wait(&self) -> Result<Option<ExitStatus>> {...}
}

impl AsRawFd for PidFd {}
impl FromRawFd for PidFd {}
impl IntoRawFd for PidFd {}

// sealed trait, implemented for std::process::Child
pub trait ChildExt {
    fn pidfd(&self) -> Result<&PidFd>;
    fn into_pidfd(self) -> Result<PidFd, Self>;
}

// sealed trait, implemented for std::process::Command
pub trait CommandExt {
    fn create_pidfd(&mut self, val: bool) -> &mut process::Command;
}

Steps / History

[x] Implementation: #77168, #81825
[x] replaced use of clone3: #113939
[x] #117957
[x] #124101
[x] #126827
[ ] Final commenting period (FCP)
[ ] Stabilization PR

Unresolved Questions

[x] How can we properly open a pidfd for the new process? The current implementation using a manual clone3 means we can't safely call libc in the child: https://github.com/rust-lang/rust/issues/89522#issuecomment-944636262
- pidfd_open may work, but it has conditions on avoiding pid-recycling races.
[ ] should Child::pidfd(&self) be removed? It can lead to Child::wait returning errors instead of a saved exit status if PidFd::wait obtains the exit status first, which may be surprising behavior.

the8472 commented 3 years ago

The PR adds support for obtaining PidFds, but you can't actually do anything with them. Do we want additional methods on PidFd in std to wait, send signals, obtain the /proc directory etc. or should that be left to 3rd party crates?

cuviper commented 2 years ago

I added an unresolved question about using clone3, given what we found in #89522.

the8472 commented 2 years ago

Could we use clone instead of clone3?

From https://github.com/rust-lang/rust/pull/77168#discussion_r496933698

The reason that I used __clone3 is that glibc's clone() implementation requires the caller to allocate a stack for the new process (even though the kernel allows passing a null stack pointer to the raw clone syscall). We could allocate a stack ourself, but I think it would be better to avoid that if at all possible.

richfelker commented 2 years ago

It's not safe to call either clone syscall directly. It's possible that the clone function handles this correctly and makes a consistent child state, but this isn't properly documented anywhere, requires the function to be aware of the argument semantics, and does not seem to be happening in practice. It's an open question on musl what we should do with this; we'll probably end up making it work like fork eventually, but it's in limbo for now. In short: don't use clone. Prefer posix_spawn, and use fork and execve when you can't.

the8472 commented 2 years ago

In short: don't use clone. Prefer posix_spawn, and use fork and execve when you can't.

The issue is that neither of those approaches provides a pidfd atomically.

@cuviper mentions pidfd_open as an option but I don't think we can uphold its requirements for race-freedom. Practically speaking using it will still be an improvement over using pids but it still undermines one of the promises of pidfds.

richfelker commented 2 years ago

What atomicity requirements would you be concerned about? As the parent of the child you just created, you are the one process who absolutely can know, with no ambiguity due to identifier reuse races, that the pid refers to the child process you created. The only way this can be undermined is by the process's own bad behavior, e.g. calling wait rather than waitpid with a pid (which is inherently a resource-lifetime-unsafe operation, so don't do that).

richfelker commented 2 years ago

Further: if you want to protect even against that, the child, between fork and execve, can open its own pidfd and pass it back to the parent, or otherwise signal to the parent that it's still alive after the pidfd_open completes in the parent via some communication channel between them. But I don't think that's necessary. Rust should not be using (or supporting the use of) resource-lifetime-unsafe operations like wait without a target pid.

the8472 commented 2 years ago

The only way this can be undermined is by the process's own bad behavior, e.g. calling wait rather than waitpid with a pid (which is inherently a resource-lifetime-unsafe operation, so don't do that).

The standard library doesn't control what user code does. There's nothing forbidding custom signal handlers or calling wait. Having a way to handle children that doesn't depend on global program state would be nice.

But I don't think that's necessary. Rust should not be using (or supporting the use of) resource-lifetime-unsafe operations like wait without a target pid.

Perhaps, but currently there's nothing forbidding it and the nix crate exposes exactly that as a safe method. https://docs.rs/nix/0.23.0/nix/sys/wait/fn.wait.html

But yeah, we could file this under the same category as manipulating /proc/self/mem, it's just one more footgun to be aware of.

richfelker commented 2 years ago

Indeed. There are all sorts of system interfaces you can use to violate invariants. The user of the language should understand that if they do that, all bets are off and they get to keep both pieces. I don't see any strong justification for preventing doing it via wait when you can still do it via process_vm_writev.

cgwalters commented 2 years ago

(Coming here from https://github.com/tokio-rs/tokio/issues/3520 because I got bit badly by having two SIGCHLD handlers in the same process)

I strongly agree with Rich here; most of the benefit is gained if std just uses pidfd_open() to start on the plain pid. In particular, that would allow async runtimes like tokio to use it to monitor the child process and avoid installing a global SIGCHLD handler, which is global mutable state that is often touched by other languages.

Any strong objections to a PR to do that?

the8472 commented 2 years ago

When pidfd_open is used in the parent process it's difficult to guarantee race-freedom, especially in the presence of other runtimes installing sigchild handlers with unknown flags. The foolproof but complex way is to do it in the child process and then send the fd to the parent process over a unix socket as mentioned in a previous comment, similar to the pipe that's used to signal exec failures.

detly commented 2 years ago

@richfelker

As the parent of the child you just created, you are the one process who absolutely can know, with no ambiguity due to identifier reuse races, that the pid refers to the child process you created.

I'm looking into integrating pidfd support into an event loop library. When I look at the details of pidfd_open() it really seems like there'd be a race condition, but all I've found on it have been unresolved arguments like this one.

Specifically, what is preventing:

Parent spawns a subprocess
Parent gets child's PID
Parent gets suspended or blocked indefinitely
Child completes or is killed
PID is re-used for another process
Parent resumes and calls pidfd_open() with the PID it got in step 2
Parent now has pidfd for the wrong process

The argument I linked to above espouses using application-specific heuristics to avoid this. But this can't apply to library writers, and still leaves the door open for coincidences (ie. the heuristics match the process that is re-using that PID).

What can a parent do after step 2 to avoid this?

cgwalters commented 2 years ago

Child completes or is killed PID is re-used for another process

The child process will stay zombie (owning the PID which will not be reused) until the parent calls waitpid(). This has nothing to do with pidfd_open() at all really, it's how Unix has worked since...well, as long as I know.

detly commented 2 years ago

Ah thanks, I had completely overlooked that. Indeed the stdlib docs even mention that as a potential exhaustion risk. Oops.

the8472 commented 1 year ago

113939 solves the clone3 issue by replacing it with `pidfd_open` in the child and then transferring the file descriptor.

With that create_pidfd should now be usable in more environments.

the8472 commented 10 months ago

For anyone following along: with #117957 pidfds are now used by by a few more methods. And two bugs have been fixed.

NobodyXu commented 10 months ago

The new pidfd_open + send over unix socket uses fork AFAIK, it would be great if it can use vfork for better perf and less memory usage.

the8472 commented 10 months ago

We have a posix_spawn path that uses vfork-like clone() under the hood. But that requires glibc to implement pidfd support in posix_spawn, which hasn't happend yet Using vfork directly already would have issues, it's extremely tricky to use correctly and it doesn't call pthread_atfork handlers. Though I don't know if we guarantee that they're called on process spawning. Using clone3 directly would be even more troublesome since we can't use glibc after it's called. We'd have to do raw syscalls.

If @joshtriplett gets his iouring_spawn work into the kernel then we could use that instead and avoid all those problems.

NobodyXu commented 10 months ago

We have a posix_spawn path that uses vfork-like clone() under the hood. But that requires glibc to implement pidfd support in posix_spawn, which hasn't happend yet

Good to see that they have plan adding that!

Though to use it, I think Rust would have to bump the minimum glibc version? Otherwise it would have to check for availability of the new functionality and fallback to existing implementation.

Using vfork directly already would have issues, it's extremely tricky to use correctly and it doesn't call pthread_atfork handlers. Though I don't know if we guarantee that they're called on process spawning.

Yes but I think it will be the most efficient implementation given that posix_spawn currently does not support pidfd.

Since posix_spawn internally uses clone on linux and vfork on unix, I think it's possible to write it in Rust.

Using clone3 directly would be even more troublesome since we can't use glibc after it's called. We'd have to do raw syscalls.

Thanks, now I see why this is a problem.

IIRC last time I tried calling clone syscall directly it's really painful, not just the arg passing, but also have to manually setup new stack.

If @joshtriplett gets his iouring_spawn work into the kernel then we could use that instead and avoid all those problems.

But it would require an io-uring to be initialised?

It would definitely be reasonable for async runtime like tokio to use it, but for stdlib to use it it would have a heavy initialisation cost to pay and not sure it's reasonable for all use cases?

the8472 commented 10 months ago

Since posix_spawn internally uses clone on linux and vfork on unix, I think it's possible to write it in Rust.

That's not necessarily true. glibc can use vfork more easily since it controls the libc state, like glibc can use clone3 internally but user code can't. At the very least in rust we'd have to use the unstable #[ffi_returns_twice] and there probably are other problems with it such as not touching anything upstack or global state.

But it would require an io-uring to be initialised? It would definitely be reasonable for async runtime like tokio to use it, but for stdlib to use it it would have a heavy initialisation cost to pay and not sure it's reasonable for all use cases?

I assume that setting up a ring once and caching it isn't that expensive, especially in comparison to bringing up a whole process (consider all the stuff that happens after exec).

Anyway, using vfork is a separate topic. We should discuss this in a new issue or on zulip. The fork + exec code-path already exists and is used for everything that posix_spawn doesn't cover. So this isn't exactly a novel issue.

NobodyXu commented 10 months ago

That's not necessarily true. glibc can use vfork more easily since it controls the libc state, like glibc can use clone3 internally but user code can't. At the very least in rust we'd have to use the unstable #[ffi_returns_twice] and there probably are other problems with it such as not touching anything upstack or global state.

Thanks I didn't know ffi_returns_twice is required.

I assume that setting up a ring once and caching it isn't that expensive, especially in comparison to bringing up a whole process (consider all the stuff that happens after exec).

True but I think that it would probably be better if it can be configured and is optional.

NobodyXu commented 9 months ago

The PR adds support for obtaining PidFds, but you can't actually do anything with them. Do we want additional methods on PidFd in std to wait, send signals, obtain the /proc directory etc. or should that be left to 3rd party crates?

I just opened https://github.com/tokio-rs/tokio/issues/6281 and realized that Pidfd has no method.

IMO having methods to wait and send signals is indeed useful, while other more exotic ones (e.g. process_madvice) is better left for third-party crates

NobodyXu commented 8 months ago

glibc just adds support for pidfd_spawn in glibc 2.39.

Once rust bumps minimum glibc ton2.39, I think we can use it instead, which will probably enable use of vfork again.

cuviper commented 8 months ago

We can use it as a weak symbol even before we ever raise our minimum that far, just like we do for spawning with chdir.

the8472 commented 8 months ago

Alas, the released pidfd_spawn API lacks functionality that clone3 has, namely the ability to return pid and pidfd at the same time. And pidfd_getpid requires procfs. This can be worked around by pre-opening /proc/self before trying to use the new API but it does add some complexity and additional sources of failure.

brauner commented 3 months ago

Alas, the released pidfd_spawn API lacks functionality that clone3 has, namely the ability to return pid and pidfd at the same time. And pidfd_getpid requires procfs. This can be worked around by pre-opening /proc/self before trying to use the new API but it does add some complexity and additional sources of failure.

I'd be open to merging a patch that extends pidfs by adding an ioctl() that returns the pid associated with the pidfd.

rust-lang / rust

Tracking Issue for linux_pidfd #82971

Public API

Steps / History

Unresolved Questions

113939 solves the clone3 issue by replacing it with `pidfd_open` in the child and then transferring the file descriptor.

rust-lang / rust

Tracking Issue for linux_pidfd #82971

Public API

Steps / History

Unresolved Questions

113939 solves the clone3 issue by replacing it with pidfd_open in the child and then transferring the file descriptor.

113939 solves the clone3 issue by replacing it with `pidfd_open` in the child and then transferring the file descriptor.