rust-lang / libs-team

The home of the library team
Apache License 2.0
110 stars 18 forks source link

std::os::unix::env::{argc, argv} #348

Open rtfeldman opened 4 months ago

rtfeldman commented 4 months ago

Proposal

Problem statement

When making FFI calls from Rust on UNIX targets, it's common to need NUL-terminated UTF-8 strings. The same is true of NUL-terminated Widechar strings on Windows FFI calls. If these strings are obtained from environment variables or process arguments, on both UNIX and Windows targets, they already exist in the required format in the process's memory. Unfortunately, in today's Rust, there is no way in std to access these in their original formats without paying for heap allocations, traversals, and/or syscalls.

Today in std the only ways to access these values are via VarsOs and ArgsOs, both of which are iterators over OsString values. These strings are not in the original format; they have been reallocated and had their NUL terminators dropped, meaning that further allocations and conversions are necessary to get them back into their original form.

On Windows, these allocations and conversions can be avoided through an unsafe direct FFI call to GetCommandLineW. There is an equivalent for this on some UNIX systems (e.g. macOS) but on others, there is no direct FFI call which exposes these. The only way to access them is through syscalls like reading /proc/self/cmdline on Linux or sysctlbyname on FreeBSD.

Motivating examples or use cases

I have a command-line application which:

Today, there is no zero-cost way to access these in Rust; the lowest-cost way that's available on each of these OSes is:

Solution sketch

Introduce these OS-specific functions to a new module, std::os::unix::env:

fn argc() -> usize;
fn argv() -> *const *const c_char;

These functions would read from these atomics, which is why they do not need to take &self.

Today, these atomics are not exposed, and there is no direct FFI-based workaround to access the values they hold. That's in part because they rely on non-standard link_section extensions. So there's no way to write a crate in userspace for these today.

For symmetry, it would seem reasonable to introduce this function to a new OS-specific module, std::os::windows::env:

fn args_widechar() -> *const *const u16;

This would be implemented as a call to GetCommandLineW, and would only be there for symmetry with the proposed std::os::windows::env, so that Windows programs didn't need to do FFI to do something that UNIX programs could use using std.

Alternatives

These functions could use CStr over *const c_char, but then they would have to be unsafe because CStr requires that the pointers be non-null, which is not a guarantee in this case. Additionally, since the motivation for this is FFI, the CStrs would likely need to be converted into *const c_chars anyway, so overall CStr seems both unsafe and unhelpful here.

It might sound reasonable to have a function which returns a slice instead of separate functions for argc and argv. However, as a comment in the current UNIX args implementation notes, argc is not necessarily an accurate length for argv, meaning that building a safe slice would require traversing the argv until a null pointer is encountered—which would be undesirable given that the motivation for this use case is to avoid overhead.

As an alternative, it could make sense to have an Iterator which iterates over argv until it encounters a null, and uses argc for a size_hint only. Another alternative would be to use Option<NonNull<...>> instead of const *, to emphasize that all the pointers could be nullable. However, in FFI use cases, the FFI APIs will be asking for raw pointers, so having access to the raw pointers is more helpful than having an Option<NonNull<...>> and especially an iterator.

So it seems like the minimal proposal here would be to expose the raw pointers, and then optionally an iterator convenience method could be discussed on top of that.

Links and related work

There are various OS-specific functions in std::os already, like std::os::unix::fs::chown.

Related threads:

programmerjake commented 4 months ago

related, on Linux at least, common programs (e.g. sshd) are known to write over their argv strings since that's how they change what name they show up as in the list of processes.

ChrisDenton commented 4 months ago

Note that if argv is mutated while Rust is collecting the arguments into a Vec, then bad things can happen. On platforms that don't allow getting argv/argc except via main this is currently mitigated by std keeping them to itself (i.e. it takes ownership).

If exposing these publicly, we would at a minimum want to strongly warn against mutating globally shared resources.

dead-claudia commented 4 months ago

If exposing these publicly, we would at a minimum want to strongly warn against mutating globally shared resources.

Typing argv as *const *const c_char IMHO already suggests these aren't mutable references. A docs note is still useful though, but only so people know argc != (argv..).take_while(|p| !p.is_null()).count() as per that above docs comment.

m-ou-se commented 4 months ago

We disscussed this in last week's libs-api meeting, but we didn't reach a consensus.

The main argument against adding these is the unclear ownership of the data these pointers point at.

Should argv() return *mut *mut (rather than *const *const) to match the type in C, since one of the possible use cases is overwriting the data (as mentioned by @programmerjake)? In that case, how could we document the safety requirements? Would we guarantee it's fine unless std::env::args[_os] is used? Is that future proof?

It might seem like this can all be avoided by making argv() return *const *const (as proposed in this ACP), to make it clear these are not mutable (as also suggested by @dead-claudia). However, that would prevent us from ever adding something like std::env::set_process_name() (or std::os::linux::set_process_name() or whatever), since that could race with any use of those *const argv pointers.

One way of looking at it, std has basically "taken ownership" of argc+argv. Perhaps it'd be cleaner to have a way to have a way to release ownership or to intercept them before it takes ownership. At least then the ownership story is clearer.

In the meeting we were wondering if your problem could be solved using a (future) language feature that allows writing your own (C-style) entry point that takes the original argc and argv from libc (the entry point that is normally provided by std that then calls your main). Then you'd be able to do with those argc+argv whatever you want and pass it on to something like std::initialize_runtime after you're done with them, passng on ownership to std.

m-ou-se commented 4 months ago

I personally think that having std::os::unix::env::{argc, argv} as proposed is fine, as long as we find a way to clearly document when these can be used safely. I guess they can never be used safely within a (safe) library, since it cannot know what other threads are doing. I'd be curious to see what the safety documentation on argv() would look like.

rtfeldman commented 4 months ago

In the meeting we were wondering if your problem could be solved using a (future) language feature that allows writing your own (C-style) entry point that takes the original argc and argv from libc (the entry point that is normally provided by std that then calls your main). Then you'd be able to do with those argc+argv whatever you want and pass it on to something like std::initialize_runtime after you're done with them, passng on ownership to std.

For my use case, that would work great! I kind of assumed something like that would be an unreasonably large change to propose. 😄

If I understand correctly, that design would also work with no_std, yeah? In that you'd just write main that way and then decline to run std::initialize_runtime (since it wouldn't be available).

Amanieu commented 4 months ago

You can already write your own C main function on stable with the #![no_main] attribute:

#![no_main]

#[no_mangle]
extern "C" fn main(argc: c_int, argv: *mut *mut c_char) -> c_int {
    0
}

The only downside is that your skip some of the initialization code normally run by the standard library, but this initialization code is optional (Rust shared libraries work fine without this code).

rtfeldman commented 4 months ago

The only downside is that your skip some of the initialization code normally run by the standard library, but this initialization code is optional (Rust shared libraries work fine without this code).

Yeah, unfortunately executables do need it (at least as far as I know!)

ChrisDenton commented 4 months ago

Executables do not need to use the standard library entry point. See https://github.com/rust-lang/rust/blob/d31b6fb8c06b43536ac5be38462d2a55784e2199/library/std/src/sys/pal/unix/mod.rs#L43 if you're interested in what it does on *nix platforms.

jmillikin commented 2 months ago

Minor Windows note: I think the type would be fn args_widechar() -> *const u16, because Windows command-line args are a single string. Splitting them into C-style argc/argv is performed by CommandLineToArgvW() if desired.