rust-lang / libs-team

The home of the library team
Apache License 2.0
115 stars 18 forks source link

ACP: A substring API for `OsStr` #306

Closed blyxxyz closed 9 months ago

blyxxyz commented 9 months ago

Proposal

Problem statement

OsStr and OsString provide access to their (unspecified) internal byte encoding using {as,into}_encoded_bytes() and from_encoded_bytes_unchecked() methods. It's possible to convert an OS string to bytes, and to convert bytes to an OS string.

However, from_encoded_bytes_unchecked() is unsafe, and there is no universal way to validate the safety invariants. Some common string operations (splitting, trimming, replacing) are impossible to implement without unsafe code.

New APIs should ideally discourage relying on any details of the internal encoding (which may be unstable and not meant for interchange).

Motivating examples or use cases

Argument parsers need to extract substrings from command line arguments. For example, --option=somefilename needs to be split into option and somefilename, and the original filename must be preserved without sanitizing it.

clap currently implements strip_prefix and split_once using transmute (equivalent to the stable encoded_bytes APIs).

lexopt (my own crate) currently uses the platform-specific APIs, but I'd like to move to the encoded_bytes API eventually. unsafe is holding me back since I have working code already and I think some of my users would consider it a regression.

The os_str_bytes and osstrtools crates provides high-level string operations for OS strings. os_str_bytes is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3). osstrtools enables reasonable uses of split() to parse $PATH and replace() to fill in command line templates.

Solution sketch

I propose a method to take a substring of an OsStr, based on offsets into the result of as_encoded_bytes(). On Unix any slice (within bounds) would be valid. On Windows this method would panic if the string is not cut on UTF-8 boundaries, exactly like the requirements of from_encoded_bytes_unchecked:

Due to the encoding being self-synchronizing, the bytes from OsStr::as_encoded_bytes can be split either immediately before or immediately after any valid non-empty UTF-8 substring.

Note that this is stricter than the actual internal encoding of OS strings on Windows.

Proposed signature:

impl OsStr {
    /// Takes a substring based on a range that corresponds to the return value of
    /// [`OsStr::as_encoded_bytes`].
    ///
    /// The range's start and end must lie on valid `OsStr` boundaries.
    ///
    /// On Unix any boundaries are valid, as OS strings may contain arbitrary bytes.
    ///
    /// On other platforms such as Windows the internal encoding is currently
    /// unspecified, and a valid `OsStr` boundary is one of:
    /// - The start of the string
    /// - The end of the string
    /// - Immediately before a valid non-empty UTF-8 substring
    /// - Immediately after a valid non-empty UTF-8 substring
    ///
    /// # Panics
    ///
    /// Panics if `range` does not lie on valid `OsStr` boundaries or if it
    /// exceeds the end of the string.
    ///
    /// # Example
    ///
    /// ```
    /// use std::ffi::OsStr;
    ///
    /// let os_str = OsStr::new("foo=bar");
    /// let bytes = os_str.as_encoded_bytes();
    /// if let Some(index) = bytes.iter().position(|b| *b == b'=') {
    ///     let key = os_str.slice_encoded_bytes(..index);
    ///     let value = os_str.slice_encoded_bytes(index + 1..);
    ///     assert_eq!(key, "foo");
    ///     assert_eq!(value, "bar");
    /// }
    /// ```
    fn slice_encoded_bytes<R: RangeBounds<usize>>(&self, range: R) -> &Self;
}

Examples

A proof of concept is implemented here: os_str_slice.rs

With an example port of lexopt to this API: https://github.com/blyxxyz/lexopt/commit/80778517f0054d391d99a787c4ee55f535ca273f

It should be trivial to port clap's transmuting operations to this API, since they all take substrings of an OsStr.

A string replace function can be implemented using OsStr::slice_encoded_bytes() and OsString::push():

fn replace(source: &OsStr, from: &str, to: &OsStr) -> OsString {
    assert!(!from.is_empty());

    let from = from.as_bytes();
    let mut result = OsString::new();
    let bytes = source.as_encoded_bytes();
    let mut prev = 0;
    let mut index = 0;
    while index < bytes.len() {
        if bytes[index..].starts_with(from) {
            result.push(source.slice_encoded_bytes(prev..index));
            result.push(to);
            index += from.len();
            prev = index;
        } else {
            index += 1;
        }
    }
    result.push(source.slice_encoded_bytes(prev..));
    result
}

Notice that it does require the needle to be non-empty UTF-8.

Guarding the internal encoding

This solution has two attractive properties:

Any API that converts from bytes to an OS string would not have these advantages.

Behavior on niche platforms

The current behavior of OS strings is only documented prominently for Unix and for Windows.

All other platforms reuse the OS string internals of either of these platforms. Almost all of the Unix-alikes expose OsStrExt/OsStringExt, which means they specify the internal encoding to be arbitrary bytes.

Only wasm uses Unix OS strings without the extension traits. I couldn't find the history of this, but it's probably intentional. JavaScript uses potentially ill-formed UTF-16, like Windows, but WebAssembly can be used in other environments as well, so there is no obvious single set of semantics. (There is some loosely related discussion at https://github.com/rustwasm/wasm-bindgen/issues/1348.)

In order to keep the slicing invariants encapsulated it might be necessary to create a third OS string implementation to be used by wasm. Since this platform can currently only legally construct OS strings from UTF-8 strings the implementation could be backed by str/String, with the side benefit of free conversion back into UTF-8 strings (currently UTF-8 validation is performed). The implementation of Unix OS strings is simple, and this implementation would be even simpler.

Alternatives

The method can be implemented using existing APIs, see the proof of concept above. But the fact that it's a minimal safe API that can be used to implement higher-level opinionated operations makes it a natural fit for the standard library.

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

Second, if there's a concrete solution:

BurntSushi commented 9 months ago

Accepted.

cc @epage

epage commented 9 months ago

@blyxxyz (since there isn't a tracking issue yet)

The method could enforce the same requirements on all platforms, including Unix (where the internal encoding is specified to be arbitrary bytes).

Personally, I would prefer this alternative so that this API focused on the documented user-facing invariants rather than the per-platform-specific undocumented invariants.

blyxxyz commented 9 months ago

@epage That was my initial preference as well, but I changed my mind while writing the proposal. The internal encoding for Unix is documented and exposed and locked-in through the extension traits, so it feels overly pedantic to go out of our way and perform these checks anyway.

I did really like fuzz testing on Unix and knowing that it would cover any future encoding on any platform. But getting the logic right didn't turn out to be very hard anyway.

epage commented 9 months ago

This came up when I was working on the encoded_bytes API (iirc in a very side conversation so none of what I say represents the libs api team endorsing this by merging that feature). We could document the per-platform behavior for the other encoded_bytes APIs but I felt it better to keep things simple from the user perspective and say that those users need to use OsStrExt for that.

This did inform the approach to the conversions docs takes which is that encoded_bytes can only offer cross-platform guarantees and that OsStrExt is still relevant for per-platform guarantees.

blyxxyz commented 9 months ago

It's definitely easier to explain, yeah. I wrote "on Unix" in the example doc comment but that's already not what I'm proposing.

It's also less work to implement, so I'll start with that and we can get more input later.

Would it be acceptable to relax the requirements after stabilization?

epage commented 9 months ago

From my understanding, it can be relaxed in the future as going from a panicking state to non-panicking shouldn't be breaking.