jmillikin commented 2 years ago

Proposal

Add a method to std::ffi::OsStr that splits it into (&str /* prefix */, &OsStr /* suffix */), where the prefix contains valid Unicode and the suffix is the portion of the original input that is not valid Unicode.

Problem statement

The OsStr type is designed to represent a platform-specific string, which might contain non-Unicode content. It has a to_str(&self) -> Option<&str> method to check the string is valid Unicode, but this method operates only on the entire string. It's not currently possible to check that a portion of the string is valid Unicode in an OS-independent way.

This proposal would add a method that lets an OsStr be split into a prefix of valid Unicode, and a suffix of remaining non-Unicode content in the platform encoding.

Motivation, use-cases

Command-line options (long flags)

One of the common use cases for OsStr is parsing command-line options, a possible format of which has a prefix (the option name) and a suffix (the option value):

Unix syntax:

# prefix: "--input="
# suffix: contents of $PATH_TO_INPUT
my_rust_program --input="${PATH_TO_INPUT}"

Windows syntax:

REM prefix: "/input:"
REM suffix: contents of %PATH_TO_INPUT%
my_rust_program /input:%PATH_TO_INPUT%

When parsing CLI options, the user wants to match option names against values provided by the program, but preserve option values as they are for use with OS APIs.

// returns Some<option_value> if `opt` starts with `--{flag_name}=`
fn opt_starts_with(opt: &OsStr, flag_name: &str) -> Option<&OsStr>;

This function is easy to implement on Unix because std::os::unix::ffi::OsStrExt provides free conversion to and from &[u8], which can be compared with the UTF-8 bytes of the flag name.

However, on Windows, it's basically impossible to implement in safe Rust -- the Windows variant of OsStrExt provides Iterator<Item = u16>, and has no mechanism for constructing a non-Unicode OsStr at all.

Command-line options (short flags)

A less ubiquitous but still commonly used format for "short options" on Unix systems allows multiple flags to be put into one option:

# the following two calls *may* be equivalent
$ my_rust_program -xyz"${PATH_TO_INPUT}"
$ my_rust_program -x -y -z "${PATH_TO_INPUT}"

Supporting this requires being able to obtain a str with prefix "-xyz". The args library then uses flag definitions to tokenize it into one of ["-x", "-y", "-z"], ["-x", "-y", "z"], or ["-x", "yz"].

Solution sketches

Add a method OsStr::to_str_split() (or whatever name folks prefer) that returns the valid Unicode prefix and the non-Unicode remainder.

// std::ffi
impl OsStr {
    pub fn to_str_split(&self) -> (&str, &OsStr);
}

Rules for the new function:

let orig_osstr: &OsStr
let (prefix, suffix) = orig_osstr.to_str_split();

// If the original OsStr is entirely valid Unicode, then the prefix
// contains the full string content and the suffix is empty.
if let Some(valid_unicode) = orig_osstr.to_str() {
    assert_eq!(prefix, valid_unicode);
    assert!(suffix.is_empty());
}

// The original OsStr can be losslessly reconstructed from the
// (prefix, suffix) pair.
let mut owned = OsString::from(prefix);
owned.push(suffix);
assert_eq!(owned, orig_osstr);

// If the original OsStr contains no valid Unicode, the prefix will
// be empty and the suffix is the unchanged original OsStr.
if prefix.is_empty() {
    assert_eq!(suffix, orig_osstr);
}

Note that the calling code would be responsible for handling inputs where the flag value itself is partial Unicode, for example on Unix all absolute paths start with the ASCII character '/'.

Links and related work

The general topic of examining OsStr for prefixes comes up often. A selection of related issues/PRs:

There is an open issue for supporting the Pattern API on OsStr, but (1) it's a significantly larger amount of implementation work, and (2) doesn't appear to allow extracting the prefix without unwrap.

https://github.com/rust-lang/rust/issues/49802)

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

jmillikin commented 2 years ago

Proposed implementation: https://github.com/jmillikin/upstream__rust/commit/c0ea2a4c4fb5c60908d580bafcc61466fff1b99f

Docs screenshots:

to_str_split

into_string_split

glandium commented 1 year ago

The same for &[u8] would be useful too.