rust-lang / libs-team

The home of the library team
Apache License 2.0
125 stars 19 forks source link

ACP: Method to split OsStr into `(str, OsStr)` #114

Closed jmillikin closed 4 months ago

jmillikin commented 2 years ago

Proposal

Add a method to std::ffi::OsStr that splits it into (&str /* prefix */, &OsStr /* suffix */), where the prefix contains valid Unicode and the suffix is the portion of the original input that is not valid Unicode.

Problem statement

The OsStr type is designed to represent a platform-specific string, which might contain non-Unicode content. It has a to_str(&self) -> Option<&str> method to check the string is valid Unicode, but this method operates only on the entire string. It's not currently possible to check that a portion of the string is valid Unicode in an OS-independent way.

This proposal would add a method that lets an OsStr be split into a prefix of valid Unicode, and a suffix of remaining non-Unicode content in the platform encoding.

Motivation, use-cases

Command-line options (long flags)

One of the common use cases for OsStr is parsing command-line options, a possible format of which has a prefix (the option name) and a suffix (the option value):

Unix syntax:

# prefix: "--input="
# suffix: contents of $PATH_TO_INPUT
my_rust_program --input="${PATH_TO_INPUT}"

Windows syntax:

REM prefix: "/input:"
REM suffix: contents of %PATH_TO_INPUT%
my_rust_program /input:%PATH_TO_INPUT%

When parsing CLI options, the user wants to match option names against values provided by the program, but preserve option values as they are for use with OS APIs.

// returns Some<option_value> if `opt` starts with `--{flag_name}=`
fn opt_starts_with(opt: &OsStr, flag_name: &str) -> Option<&OsStr>;

This function is easy to implement on Unix because std::os::unix::ffi::OsStrExt provides free conversion to and from &[u8], which can be compared with the UTF-8 bytes of the flag name.

However, on Windows, it's basically impossible to implement in safe Rust -- the Windows variant of OsStrExt provides Iterator<Item = u16>, and has no mechanism for constructing a non-Unicode OsStr at all.

Command-line options (short flags)

A less ubiquitous but still commonly used format for "short options" on Unix systems allows multiple flags to be put into one option:

# the following two calls *may* be equivalent
$ my_rust_program -xyz"${PATH_TO_INPUT}"
$ my_rust_program -x -y -z "${PATH_TO_INPUT}"

Supporting this requires being able to obtain a str with prefix "-xyz". The args library then uses flag definitions to tokenize it into one of ["-x", "-y", "-z"], ["-x", "-y", "z"], or ["-x", "yz"].

Solution sketches

Add a method OsStr::to_str_split() (or whatever name folks prefer) that returns the valid Unicode prefix and the non-Unicode remainder.

// std::ffi
impl OsStr {
    pub fn to_str_split(&self) -> (&str, &OsStr);
}

Rules for the new function:

let orig_osstr: &OsStr
let (prefix, suffix) = orig_osstr.to_str_split();

// If the original OsStr is entirely valid Unicode, then the prefix
// contains the full string content and the suffix is empty.
if let Some(valid_unicode) = orig_osstr.to_str() {
    assert_eq!(prefix, valid_unicode);
    assert!(suffix.is_empty());
}

// The original OsStr can be losslessly reconstructed from the
// (prefix, suffix) pair.
let mut owned = OsString::from(prefix);
owned.push(suffix);
assert_eq!(owned, orig_osstr);

// If the original OsStr contains no valid Unicode, the prefix will
// be empty and the suffix is the unchanged original OsStr.
if prefix.is_empty() {
    assert_eq!(suffix, orig_osstr);
}

Note that the calling code would be responsible for handling inputs where the flag value itself is partial Unicode, for example on Unix all absolute paths start with the ASCII character '/'.

Links and related work

The general topic of examining OsStr for prefixes comes up often. A selection of related issues/PRs:

There is an open issue for supporting the Pattern API on OsStr, but (1) it's a significantly larger amount of implementation work, and (2) doesn't appear to allow extracting the prefix without unwrap.

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

jmillikin commented 2 years ago

Proposed implementation: https://github.com/jmillikin/upstream__rust/commit/c0ea2a4c4fb5c60908d580bafcc61466fff1b99f

Docs screenshots:

to_str_split

into_string_split

glandium commented 1 year ago

The same for &[u8] would be useful too.

the8472 commented 1 year ago

https://github.com/rust-lang/rust/pull/95290 should be relevant. Once the underlying bytes are exposed this could be implemented by user code.

jmillikin commented 1 year ago

PR rust-lang/rust#95290 (exposing raw WTF-8 bytes on Windows) isn't appealing to me, and even if such an API would be created I would want to have to_str_split() so that I didn't have to implement WTF-8 decoding in my own binaries.

jmillikin commented 1 year ago

Gentle ping -- I'm still interested in this, is there any interest from the libs-api team regarding OsStr unicode prefix splitting?

the8472 commented 1 year ago

(exposing raw WTF-8 bytes on Windows) isn't appealing to me, and even if such an API would be created I would want to have to_str_split() so that I didn't have to implement WTF-8 decoding in my own binaries.

Once you get the raw bytes you can use regular string APIs such as str::from_utf8 which gives you either the full &str or an error that carries the necessary information to construct the rest. We have discussed adding some convenience methods to the error to make this kind of stuff simpler. That would be more general because it isn't limited to osstr specifically and can be used for processing other [u8] sources too.

jmillikin commented 1 year ago

I might be misunderstanding, but I don't think that would solve the problem because (to the best of my knowledge) there is no way to convert a &[u8] to &OsStr on Windows. It would also require me to have #[cfg(os)] lines in my code to handle the different internal representations (UTF-8 vs WTF-8), which is a portability hazard.

In other words, how would you implement fn to_str_split(s: &OsStr) -> (&str, &OsStr) on top of the API you propose? Again, I might not be understanding you properly, but I can't think of any way to implement such an API on Windows.

mina86 commented 1 year ago

Not only that. Even if I could convert to and from &OsStr easily, if uses passes --file=/path/to/file<bad-utf>name the method will give me ("--file=/path/to/file", "<bad-utf>name"). I can check that the argument starts with --file but that still leaves me with ("/path/to/file", "<bad-utf>name") which I cannot concatenate without unsafe shenanigans or memory allocation.

jmillikin commented 1 year ago

@mina86 For trimming prefixes and suffixes from an &OsStr, I think you'd want something more like the Pattern methods being tracked in https://github.com/rust-lang/rust/issues/49802. I'm not aware of anyone currently working on that functionality.

This ACP is tracking a different request, which is the ability to obtain the valid Unicode portion of an &OsStr.

mina86 commented 1 year ago

This ACP is tracking a different request, which is the ability to obtain the valid Unicode portion of an &OsStr.

The motivation for this ACP is argument parsing. I don’t see how to_str_split is a good solution for that. For example, assuming to_str_split existed in the form described here, could you demonstrate implementation of opt_starts_with?

jmillikin commented 1 year ago

opt_starts_with is just an example of an API that's not possible to implement without being able to obtain the unicode prefix. The actual code I'm writing is a bit more complex (it's a port of https://github.com/jmillikin/haskell-options/blob/master/lib/Options/Tokenize.hs).

The output type of the tokenizer step function looks approximately like this:

pub struct Arg<'a>(
    #[cfg(all(target_family = "unix", not(feature = "std")))]
    &'a [u8],
    #[cfg(all(target_family = "windows", not(feature = "std")))]
    &'a str,
    #[cfg(feature = "std")]
    Cow<'a, OsStr>,
);

pub enum Token<'a, FlagId> {
    Arg(Arg<'a>),
    Flag(FlagName<'a>, FlagId, Arg<'a>),
    FlagUnary(FlagName<'a>, FlagId),
    FlagHelp(Option<&'a str>),
}

To separate a single OsStr into unary and non-unary tokens requires consuming it as a sequence of Unicode characters until encountering either a = (long flag case) or a non-unary short flag. The set of possible flag names is maintained in a separate data structure.

When parsing an OsStr arg, if it contains only valid Unicode then it's returned in a Cow::Borrowed. If there's a non-Unicode suffix then I allocate an OsString, push the two fragments into it, and return it in a Cow::Owned.

jmillikin commented 1 year ago

FYI: I created a PR for the proposed implementation branch: https://github.com/rust-lang/rust/pull/111059

I'm not sure what the exact ordering is of ACP <-> unstable PR, but maybe having the implementation code available for review will help when reading the ACP.