rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.58k stars 12.74k forks source link

I/O safety forbids the "pass FD via env var" pattern (e.g., jobserver) #116059

Open RalfJung opened 1 year ago

RalfJung commented 1 year ago

In #114780 we have clarified what is meany by I/O safety. In particular:

//! To uphold I/O safety, it is crucial that no code acts on file descriptors it does not own or
//! borrow, and no code closes file descriptors it does not own. In other words, a safe function
//! that takes a regular integer, treats it as a file descriptor, and acts on it, is *unsound*.

This is unfortunately in conflict with a reasonably common pattern on Unix systems: execing a process with a file descriptor initialized somewhere, and setting an env var to tell it where to find the FD. This pattern is used, for instance, by the jobserver protocol. The jobserver crate hence just takes an integer from an env var, turns it into a file descriptor, and reads/writes to that -- a violation of I/O safety. Crates like cc hence are technically unsound when they expose a safe function that internally uses the jobserver crate. The potential concern here is that the env var might be wrong, the jobserver crate now acts on a random FD by someone else, and that someone might have relied on I/O safety to protect their FD from foreign influence.

Assuming that we don't want to tell the cc maintainer that cc::Build::new() must be unsafe, what shall we do about this? There's been a lot of prior discussion:

I'm aware of the following proposals to resolve this situation:

  1. "export" the safety burden to the environment: starting a process that uses cc and setting the env var the wrong way is violating a precondition of this process, and Rust's soundness guarantees do not apply. This is undermined by set_var though; even if we make that unsafe we'd have to phrase its safety contract very carefully if it want to go this route.
  2. Weaken the entire concept of I/O safety to no longer provide any protection against random FDs being read/written, and only protect against random FDs being closed.
  3. Develop some elaborate system of "initially existing global FDs" that cc/jobserver could use to be sure that the FD they work on already existed when the program started, and is not some other crate's private property. See here for a bit of elaboration on that idea.

We should pick one, or develop some alternative that provides a satisfying answer to how cc::Build::new() can be a sound function.

the8472 commented 1 year ago

Assuming that we don't want to tell the cc maintainer that cc::Build::new() must be unsafe, what shall we do about this?

Well, we could promote more sane protocols instead such as named fifos, file-descriptor passing via unix sockets or also putting the dev/ino in the environment variable to enable sanity checks in the child.

Then the sane ways are safe and the legacy approaches are unsafe.

RalfJung commented 1 year ago

I'm kind of assuming that this would take way too long to really prevent us from having to come up with a justification for what happens currently. Like, realistically, when could cc start refusing to use the old protocol (required to justify its soundness) -- no less than five years from now, I would assume.

(And that's assuming that someone will actually spend all the effort of doing that, and they will be able to convince the rest of the ecosystem to adopt the new approach. In particular the second point I am doubtful about.)

the8472 commented 1 year ago

The new jobserver protocol already exists, we didn't even have to invent it. We just need to works towards making it the default (yes, that'll take time) But it could still serve as precedent for future projects.

I'm kind of assuming that this would take way too long to really prevent us from having to come up with a justification for what happens currently.

Well, we can be pragmatic there and treat it like before_exec which should be unsafe but for backwards-compatibility reasons it isn't. Ideally we'd have #[deprecated_safe] for that but it looks like that's in limbo.

ChrisDenton commented 1 year ago

If anyone has a concrete proposal for cc supporting the new jobserver protocol, it would be useful to open an issue on the cc issue tracker.

ple1n commented 1 year ago

Well, we could promote more sane protocols instead such as named fifos, file-descriptor passing via unix sockets or also putting the dev/ino in the environment variable to enable sanity checks in the child.

There is nothing "insane" about passing FDs. They are perfectly fine and efficient. If your security means a loss of features, it's not good security. You can have zero-cost (relatively) security with type system and formal methods.

carbotaniuman commented 1 year ago

A good starting point would be taking a look at this PR: https://github.com/rust-lang/jobserver-rs/pull/57

RalfJung commented 1 year ago

You can have zero-cost (relatively) security with type system and formal methods.

There's nothing we can do inside Rust to make these protocols well-behaved. This is a cross-process communication problem; types within a programming language are not able to help (much) here. We'd need the OS itself to enforce interface types in processes.

Some interfaces are just inherently cursed and cannot be contained with language-level techniques, they need a more security-aware OS/platform design: /proc/self/mem, bare FD passing, mmap -- I'm sure the list goes on. Sometimes all formal methods can do is point out that an API is just not designed well enough for modular/compositional reasoning principles to apply.

If you want to discuss this further, please do so on Zulip; this issue is about what concretely we can do about the mess that is passing bare FDs around, not the philosophy of whether that interface is "sane" or not. This pattern is causing problems, as documented in the issue description. If you want to help, make concrete suggestions for what could be done. Asking us to use "type system and formal methods" is not helpful.

the8472 commented 1 year ago

Note that I was talking about the specific incarnation of an FD-passing protocol used by the make jobserver. Not about FD-passing as a general category. In the very same post I proposed more reliable alternatives.

RalfJung commented 1 year ago

I'm still pretty fond of the idea of having std collect and manage ownership of all initially existing FDs. I'm just not sure if it can be practically realized for cases such as cdylib... however, for things like cc::Build, something that works in Rust bin crates is sufficient, isn't it?

the8472 commented 1 year ago

cc uses the jobserver lib for this, which is also used in other crates (including rustc and cargo), some of those could be dynamic libs. The only difference is that jobserver does handle this correctly and marks the from_env method as unsafe, its caller in cc doesn't. Classic problem of wrapping an unsafe method in a safe one without upholding its contract. So I don't think a primordial file descriptor registry in std would help cc directly. It could help jobserver to provide a safer API in non-dynlib contexts, which then could then be used by cc.

But that's a lot of steps to make that work and file-descriptor-enumeration is kinda shaky as a cross-unix feature - e.g. procfs isn't part of POSIX and manually enumerating FDs at startup would be costly - and probably unsupportable for windows handles. Declaring the old version of the make protocol as permanently unsafe (which jobserver has acknowledged from day 1!) and steering people towards the newer version would eventually let cc have a safe API that only supports the new version and have the same Safety requirements as jobserver on another method to support the old one.

RalfJung commented 1 year ago

Declaring the old version of the make protocol as permanently unsafe (which jobserver has acknowledged from day 1!) and steering people towards the newer version would eventually let cc have a safe API that only supports the new version and have the same Safety requirements as jobserver on another method to support the old one.

I'd be more optimistic about this plan if it was just jobserver, but systemd socket activation uses a similar protocol AFAIK, and that's generally considered a "modern" platform. Is there even an alternative to FD passing there?

the8472 commented 1 year ago

systemd's C function for this (sd_listen_fds) at least checks the PID of the current process against those environment variables so this doesn't get foisted on entire process hierarchies and they won't go out of sync , which makes it a bit better. In Rust terms it would still have to be unsafe because either something needs to take ownership of the FD numbers exactly once at program startup or they must remain open for the program's lifetime.

We could ask the systemd devs whether they'd be willing to add another environment variable listing the inode numbers to increase verifiability.

Though I suppose there's still some conflict between the static-borrow vs. the owned usage model of those descriptors. A registry could arbitrate that...

carbotaniuman commented 1 year ago

I think in general we actually have to solve the problem, either by changing the model in some way, defining it as outside of Rust's scope, or just accepting willful unsoundness. There's a lot of crates that don't think about this, for example https://lib.rs/crates/sd-listen-fds, lots of old (and even new) APIs and protocols that just pass FDs to a number in an environment variable (or even just choose something arbitrary like 3).

We can try and solve the ecosystem issues that we're facing, but we do realistically have to accept that we're not going to change decades of Unix tradition and see how we're going to fit in with this.

I'm still pretty fond of the idea of having std collect and manage ownership of all initially existing FDs.

This is not really portable in a practical way - I suppose you can have pre-main code check F_GETFD on the entire possible range, but that does not seem like something we should be doing.

the8472 commented 1 year ago

The std-listen-fds crate does several questionable things. Afaict it allows you to call the method any number of times and yet each time it returns an OwnedFd. It transmutes (unsafe call and bypassing the FromRawFd contract!) to OwnedFd without any safety contract of its own. It also doesn't clear the environment and it doesn't set CLOEXEC. So it's worse than the C implementation.

I don't think poorly written unsafe code is a good justification for just throwing up our arms and giving up.

ChrisDenton commented 1 year ago

I'm still pretty fond of the idea of having std collect and manage ownership of all initially existing FDs. I'm just not sure if it can be practically realized for cases such as cdylib... however, for things like cc::Build, something that works in Rust bin crates is sufficient, isn't it?

How are you imaging this would work?

RalfJung commented 1 year ago

How are you imaging this would work?

jobserver could safely call borrow, and systemd socket activation looks like it would want to use take. std's own handing of stdin/stdout/stderr would also use borrow.

Fulgen301 commented 1 year ago

(and hope no larger FDs are used for this kind of FD passing)

Assuming a maximum FD number has historically turned out badly for fd_set.

iterate over all FDs to determine which ones have already existed at program startup

There is no way to take a snapshot of all FDs that have existed at program startup, only a way to enumerate the currently open FDs - which may sound like nitpicking, but enumeration would also pick up FDs that the libc implementation might be temporarily using in another thread it created, or other internal FDs that happened to be open at the time, and std assuming ownership of FDs it does not own would be a violation of IO safety itself.

The environment variable itself is just external untrusted input like any other input - whether I read the FD number from an environment variable or a file doesn't change its value. And a table wouldn't help against all forms of invalid input anyway - say I tell a program to use the FD 3 without it being a valid FD in the program. The program then opens a file, and as 3 is unused, it gets assigned to that file. A table wouldn't have helped because the only thing File could do is check the table for whether the return value of open already was in that table and panic if that was the case, which would mean every single function calling a system function that returns a FD would have to check the table, which isn't exactly zero-cost either.

RalfJung commented 1 year ago

Assuming a maximum FD number has historically turned out badly for fd_set.

That's not entirely the same situation, since we are only bounding the maximum FD number present at program startup.

Also, I'm not saying the solution is perfect. There are no great solutions here that I can see. We're merely looking for the least-bad solution. Is bounding the number to 1024 (or 64k or whatever we will pick) -- and only for cases where /proc or other enumeration methods are not available -- better or worse than risking soundness issues when a badly configured env var causes one piece of Rust code to clobber a file descriptor privately used by another piece of Rust code? We could also work with the Linux kernel towards having FD enumeration available as a systcall rather than just via /proc.

enumeration would also pick up FDs that the libc implementation might be temporarily using in another thread it created, or other internal FDs that happened to be open at the time

Yes, libc having other FDs open is a potential problem. We'd still get a hard guarantee on libc's that don't do that, and we could work with libc's that do do that to be able to distinguish their internal FDs from "primordial" FDs present at program startup. Do you know any libc that does things like that? (The libc opening and closing files before jumping to the Rust entry point would be fine. Only having any files still open, including in other threads, would be an issue.)

And a table wouldn't help against all forms of invalid input anyway - say I tell a program to use the FD 3 without it being a valid FD in the program. The program then opens a file, and as 3 is unused, it gets assigned to that file. A table wouldn't have helped

The table will help here, without File doing anything. The table will simply not contain FD 3 since the table was populated before the File got created, so when jobserver asks the table to borrow FD 3, it will get None.

which would mean every single function calling a system function that returns a FD would have to check the table, which isn't exactly zero-cost either.

You have misunderstood my proposal, this is not part of it.

sunfishcode commented 1 year ago

@RalfJung A problem with that registry design is that if Rust code be linked with C libraries, the C libraries could claim ownership of some of the incoming fds and close them, without expecting that it needs to notify Rust's std.

I'd like to probe approach (1) more, exporting the safety burden to the environment. The informal reasoning people use in practice is something like "the name SOMEPREFIX_FD is sufficiently qualified that anyone who gets or sets an environment variable with that name it can be assumed to be aware of the protocol", and that covers the set_var concern too. That may not seem satisfying, but the underlying problem seems to be a Unix idiom. With the examples that have come up, it's clear that this is a widespread expectation of how programs on Unix are expected to work.

To call from_raw_fd/borrow_raw on a fd number, one should have a proof that their safety conditions are upheld. If the number comes from an environment variable string, that implies one should have a proof that the string contains the number of an open fd. How could a proof of that be constructed? That's where we could say to developers "it's your responsibility to pick a sufficiently distinct environment variable name, and to monitor its usage in all adjacent code, and use PID checks if appropriate, to ensure that everyone using it is aware of the protocol". If they do that, then they can build their proof on that, and satisfy the existing from_raw_fd/borrow_raw safety constraint.

RalfJung commented 1 year ago

I'd like to probe approach (1) more, exporting the safety burden to the environment.

Just to be clear, what this means is we are deciding that we are okay with UB and soundness bugs when the environment (or other parts of the program) sets these env vars the wrong way. We are not expecting Rust code to be resilient against such environment issues.

This does leave me quite uneasy.

bjorn3 commented 1 year ago

The parent process could have made a copy of the executable it is going to spawn, modify it in a way that will cause UB and then execute this copy. Or it could ptrace it even on systems that restrict arbitrary ptracing through the yama LSM ptrace_scope option (default on Ubuntu I believe). In other words you already have to trust the parent process.

RalfJung commented 1 year ago

You are comparing scenarios that are not even remotely comparable. "It's UB if you modify the binary" and "it's UB if some environment variable has a wrong value" are not even in the same league, those scenarios are solar systems apart. Setting an environment variable is a perfectly normal every-day thing for everyone using the shell or spawning processes. Modifying the executable image is orders of magnitude less common. Same for attaching a tracer that actually alters program behavior.

Nobody will be surprised if you can cause UB by putting a program into gdb and just changing some variables. Many people (myself included) will be surprised if you can cause UB by running MAKEFLAGS="--jobserver-auth=5,6" ./my-program. I would generally expect programs to be resilient to user errors like that.

carbotaniuman commented 1 year ago

Yeah, it's not great that an environment variable can cause UB in a program, but the environment itself is a hostile places for Rust programs, so I'm not sure where the line is (hence the opsem proposal just opened).

I will note we have a classic trilemma here: safe abstractions to mmap (or similar) are allowed, jobserver and similar (Unix mail, some web hosting stuff, etc) are allowed, and having Rust code be resilient to environment variable shenanigans[^1][^2].

There are potentially other solutions that get us out of the trilemma, but I am deeply skeptical of any of them actually working for Rust users, but more importantly not becoming a landmine for FFI ala set_var. I do think that all options of the trilemma will ultimately be quite wide-reaching, so there are not really any easy solutions here alas.

[^1]: I am defining "shenanigans" here as common use in the way that Ralf probably means it, and am handwaving the /dev/fd discussion I brought up earlier. [^2]: The options for the trilemma are (in order) approach 2 in the top comment, status quo, approach 1 in the top comment.

the8472 commented 1 year ago

This isn't about a hostile environment though. It's possible to accidentally let the file descriptors and the environment variables get out of sync within a process hierarchy. All it takes is a single process in the process hierarchy that closes or renumbers the descriptors at some point before spawning a child but doesn't change the environment for the child to be faced with potentially-UB-causing inputs. I think I somewhere saw a jobserver-related issue reporting that this actually happened in the wild.

I'd put mmap and fd-passing in different categories. mmap is a standardized OS-provided primitive which has some fundamental prerequisites to be used safely and while those prerequisites are hard to check or enforce strictly they're quite reasonable and also required for correctness and robustness of many other things. Shaping rules around that makes sense.

Fd-passing on the other hand is a landscape of possible protocols. Which provides the flexibility of drawing a line around some areas in that landscape and saying "these are unsafe" and leaving others for safe use.

I don't see how saying that something is unsafe is onerous. This is basically FFI, just through a weird interface.

sunfishcode commented 1 year ago

Yes, exporting the safety burden to the environment is not great. But, it's How Unix Has Always Worked and it doesn't add any costs, limitations, or opinions to every program, whether it wants them or not.

I'd also like to bring up the idea of a structured environment-variable API for consideration. Perhaps such an API could look something like this:

#[derive(IncomingValues)]
struct Config {
    #[value(parse_var = from_raw_decimal_fd, var = "SOME_FD", pid_var = "SOME_PID")]
    file: File,
}

fn main(config: Config) {
    ...
}

or in a lib.rs:

static CONFIG: Config = incoming_values!();

I don't have a complete design to propose, but the idea is: a design in this direction has the potential to completely encapsulate the safety burden. Any costs, limitations, or opinions it has could be opt-in. It could be part of a broader plan to make set_var unsafe if we wanted. It could be a way to nudge developers to add PID-check variables like systemd's LISTEN_PID which can help catch problems. And it could coexist with C libraries or pre-existing Rust code that doesn't know it needs to coordinate with a fd registry.

If we think of that as the path forward, then perhaps it's easier to see how "export the safety burden to the environment" makes sense to do here now, because that's no longer the last word on the subject.

RalfJung commented 1 year ago

Having every program written in C and having Undefined Behavior left and right is How Unix Has Always Worked, so I am not persuaded by that argument. The entire raison d`être of Rust is to raise the bar for the level of safety in systems programming. Allowing programs to be UB when an env var has the wrong integer in it seems like giving up on a notable chunk of that goal.

sunfishcode commented 1 year ago

@RalfJung Could I ask you to share your thoughts on the rest of my post as well?

RalfJung commented 1 year ago

I didn't entirely understand the API you were proposing. What are the underlying semantics, which operations are safe, what are the soundness promises and assumptions? Is the idea that instead if a table of "primordial" FDs, we have some declaration for which primordial FDs we are interested in and then std will hand them to us?

Potential problems with this:

sunfishcode commented 1 year ago

Yes, the idea is that all operations would be safe. And there may be other ways to arrange the lib.rs use case.

But also, yes, a more complete design for this feature would imply arbitrary parsing code running before main. That's a problem I hadn't anticipated. I don't yet have any ideas for how to avoid that.

On the other hand, if we go with the fd registry, I don't know how we could avoid breaking C libraries and existing Rust code that claims incoming fds without coordinating with the registry.

RalfJung commented 1 year ago

On the other hand, if we go with the fd registry, I don't know how we could avoid breaking C libraries and existing Rust code that claims incoming fds without coordinating with the registry.

Your approach doesn't really solve that either, does it? Existing Rust and external C code can still just claim an FD that was also passed in via Config. There is no solving this, code that accesses raw FDs without coordination is inherently incompatible with the idea of I/O safety.

The major problems of the registry of primordial FDs that I can see are:

The second point could possibly be avoided by taking a hint from your proposal: instead of having access to the FD registry via global functions in std, we could change main to receive an &FdRegistry argument. Compared to your proposal that would avoid running arbitrary code before main, at the cost of not always supporting arbitrary FD numbers.

jmillikin commented 7 months ago

Assuming that we don't want to tell the cc maintainer that cc::Build::new() must be unsafe,

I might have missed this part of the discussion, but I don't understand how the behavior of that function can be considered safe or sound if it behaves as described in this thread. Isn't the only possible solution that satisfies Rust safety goals to change the cc crate such that the unsafe FD construction from env variables is moved out of the safe call graph?

In other words, a bare cc::Build::new() can't support --jobserver-auth=5,6. Support for that option must be gated by an unsafe fn Build::posix_jobserver_auth(&mut self) or some such method.

This isn't a matter of being "technically" unsound, it's a quite significant soundness bug in cc and should be fixed there rather than trying to encode that particular bug into the Rust semantics.

NobodyXu commented 7 months ago

Assuming that we don't want to tell the cc maintainer that cc::Build::new() must be unsafe,

FYI, cc::Build::new() does not call jobserver::Client::from_env.

It's cc::Build::{compile, try_compile, compile_intermediates, try_compile_intermediates} which could call it, if feature parallel is enabled.

We could, introduce a new function to cc::Build, to disable use of jobserver::Client::from_env, and another function to set the preferred parallelism.

I'm a bit reluctant to adding a function to provide the jobserver::Client to cc::Build, since it hasn't reached 1.0 yet, though neither cc has, and both jobserver and cc seems to avoid breaking change.