blyxxyz commented 9 months ago

Proposal

Problem statement

OsStr and OsString provide access to their (unspecified) internal byte encoding using {as,into}_encoded_bytes() and from_encoded_bytes_unchecked() methods. It's possible to convert an OS string to bytes, and to convert bytes to an OS string.

However, from_encoded_bytes_unchecked() is unsafe, and there is no universal way to validate the safety invariants. Some common string operations (splitting, trimming, replacing) are impossible to implement without unsafe code.

New APIs should ideally discourage relying on any details of the internal encoding (which may be unstable and not meant for interchange).

Motivating examples or use cases

Argument parsers need to extract substrings from command line arguments. For example, --option=somefilename needs to be split into option and somefilename, and the original filename must be preserved without sanitizing it.

clap currently implements strip_prefix and split_once using transmute (equivalent to the stable encoded_bytes APIs).

lexopt (my own crate) currently uses the platform-specific APIs, but I'd like to move to the encoded_bytes API eventually. unsafe is holding me back since I have working code already and I think some of my users would consider it a regression.

The os_str_bytes and osstrtools crates provides high-level string operations for OS strings. os_str_bytes is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3). osstrtools enables reasonable uses of split() to parse $PATH and replace() to fill in command line templates.

Solution sketch

I propose a method to take a substring of an OsStr, based on offsets into the result of as_encoded_bytes(). On Unix any slice (within bounds) would be valid. On Windows this method would panic if the string is not cut on UTF-8 boundaries, exactly like the requirements of from_encoded_bytes_unchecked:

Due to the encoding being self-synchronizing, the bytes from OsStr::as_encoded_bytes can be split either immediately before or immediately after any valid non-empty UTF-8 substring.

Note that this is stricter than the actual internal encoding of OS strings on Windows.

Proposed signature:

impl OsStr {
    /// Takes a substring based on a range that corresponds to the return value of
    /// [`OsStr::as_encoded_bytes`].
    ///
    /// The range's start and end must lie on valid `OsStr` boundaries.
    ///
    /// On Unix any boundaries are valid, as OS strings may contain arbitrary bytes.
    ///
    /// On other platforms such as Windows the internal encoding is currently
    /// unspecified, and a valid `OsStr` boundary is one of:
    /// - The start of the string
    /// - The end of the string
    /// - Immediately before a valid non-empty UTF-8 substring
    /// - Immediately after a valid non-empty UTF-8 substring
    ///
    /// # Panics
    ///
    /// Panics if `range` does not lie on valid `OsStr` boundaries or if it
    /// exceeds the end of the string.
    ///
    /// # Example
    ///
    /// ```
    /// use std::ffi::OsStr;
    ///
    /// let os_str = OsStr::new("foo=bar");
    /// let bytes = os_str.as_encoded_bytes();
    /// if let Some(index) = bytes.iter().position(|b| *b == b'=') {
    ///     let key = os_str.slice_encoded_bytes(..index);
    ///     let value = os_str.slice_encoded_bytes(index + 1..);
    ///     assert_eq!(key, "foo");
    ///     assert_eq!(value, "bar");
    /// }
    /// ```
    fn slice_encoded_bytes<R: RangeBounds<usize>>(&self, range: R) -> &Self;
}

Examples

A proof of concept is implemented here: os_str_slice.rs

With an example port of lexopt to this API: https://github.com/blyxxyz/lexopt/commit/80778517f0054d391d99a787c4ee55f535ca273f

It should be trivial to port clap's transmuting operations to this API, since they all take substrings of an OsStr.

A string replace function can be implemented using OsStr::slice_encoded_bytes() and OsString::push():

fn replace(source: &OsStr, from: &str, to: &OsStr) -> OsString {
    assert!(!from.is_empty());

    let from = from.as_bytes();
    let mut result = OsString::new();
    let bytes = source.as_encoded_bytes();
    let mut prev = 0;
    let mut index = 0;
    while index < bytes.len() {
        if bytes[index..].starts_with(from) {
            result.push(source.slice_encoded_bytes(prev..index));
            result.push(to);
            index += from.len();
            prev = index;
        } else {
            index += 1;
        }
    }
    result.push(source.slice_encoded_bytes(prev..));
    result
}

Notice that it does require the needle to be non-empty UTF-8.

Guarding the internal encoding

This solution has two attractive properties:

It does not make it any easier to use the internal encoding for interchange (which WTF-8 prohibits).
It can enforce stricter requirements than the actual internal encoding, which makes it hard for user code to rely on unspecified details.

Any API that converts from bytes to an OS string would not have these advantages.

Behavior on niche platforms

The current behavior of OS strings is only documented prominently for Unix and for Windows.

All other platforms reuse the OS string internals of either of these platforms. Almost all of the Unix-alikes expose OsStrExt/OsStringExt, which means they specify the internal encoding to be arbitrary bytes.

Only wasm uses Unix OS strings without the extension traits. I couldn't find the history of this, but it's probably intentional. JavaScript uses potentially ill-formed UTF-16, like Windows, but WebAssembly can be used in other environments as well, so there is no obvious single set of semantics. (There is some loosely related discussion at https://github.com/rustwasm/wasm-bindgen/issues/1348.)

In order to keep the slicing invariants encapsulated it might be necessary to create a third OS string implementation to be used by wasm. Since this platform can currently only legally construct OS strings from UTF-8 strings the implementation could be backed by str/String, with the side benefit of free conversion back into UTF-8 strings (currently UTF-8 validation is performed). The implementation of Unix OS strings is simple, and this implementation would be even simpler.

Alternatives

A checked, safe from_encoded_bytes(&[u8]) method. This is a natural counterpart to from_encoded_bytes_unchecked(&[u8]). But:
- It would encourage relying on the unspecified internal encoding and using it for interchange.
- It would require validating the entire string, which may take O(n) time, while a slicing method can run in O(1) time.
- It would require implementing a WTF-8 validator, which to my knowledge doesn't exist yet. (The original wtf8 crate deliberately does not implement this.)
A split_at(&self, pos: usize) -> (&OsStr, &OsStr) style method. This is ostensibly simpler than a slicing API. But:
- The OMG-WTF-8 encoding does not support this. It requires some splits to produce overlapping slices, so that left.len() + right.len() > orig.len(). The encoded bytes on one side of pos might form a valid OMG-WTF-8 string while those on the other side do not.
  - Even if we fudge it by shifting the boundary this prohibits a mutable variant of the method, since overlapping mutable references are unsound. A mutable version of slice_encoded_bytes() would be unproblematic.
    - (The only currently existing mutating methods on OsStr are make_ascii_lowercase() and make_ascii_uppercase().)
  - This would only become a concern if we switch to this (or a similar) encoding and start specifying it.
  - RFC #2295 that proposes this encoding was merged five years ago (but has not been implemented).
The method could return None rather than panicking.
- Incorrect offsets would typically imply a bug or an assumption about the internal encoding. I can't off-hand think of any uses for this.
The method could enforce the same requirements on all platforms, including Unix (where the internal encoding is specified to be arbitrary bytes).
- This would make it easier to test that user logic is encoding-independent.
  - Writing encoding-independent logic is not very hard to begin with, though.
The method could panic only if even the unspecified invariants of the internal encoding are violated.
- This would encourage depending on the details of the internal encoding.
The method could instead be an Index impl for OsStr.
- This would perhaps be too prominent.
- It would be harder to find the documentation with the exact requirements.
- It may make sense to add this if/when OsStr has a richer API that allows determining offsets without calling as_encoded_bytes().

The method can be implemented using existing APIs, see the proof of concept above. But the fact that it's a minimal safe API that can be used to implement higher-level opinionated operations makes it a natural fit for the standard library.

Links and related work

os_str_bytes
osstrtools
RFC #2295
rust-lang/rust#111544
114
rust-lang/rust#109350

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

We think this problem seems worth solving, and the standard library might be the right place to solve it.
We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

BurntSushi commented 9 months ago

Accepted.

cc @epage

epage commented 9 months ago

@blyxxyz (since there isn't a tracking issue yet)

The method could enforce the same requirements on all platforms, including Unix (where the internal encoding is specified to be arbitrary bytes).

Personally, I would prefer this alternative so that this API focused on the documented user-facing invariants rather than the per-platform-specific undocumented invariants.

blyxxyz commented 9 months ago

@epage That was my initial preference as well, but I changed my mind while writing the proposal. The internal encoding for Unix is documented and exposed and locked-in through the extension traits, so it feels overly pedantic to go out of our way and perform these checks anyway.

I did really like fuzz testing on Unix and knowing that it would cover any future encoding on any platform. But getting the logic right didn't turn out to be very hard anyway.

epage commented 9 months ago

This came up when I was working on the encoded_bytes API (iirc in a very side conversation so none of what I say represents the libs api team endorsing this by merging that feature). We could document the per-platform behavior for the other encoded_bytes APIs but I felt it better to keep things simple from the user perspective and say that those users need to use OsStrExt for that.

This did inform the approach to the conversions docs takes which is that encoded_bytes can only offer cross-platform guarantees and that OsStrExt is still relevant for per-platform guarantees.

blyxxyz commented 9 months ago

It's definitely easier to explain, yeah. I wrote "on Unix" in the example doc comment but that's already not what I'm proposing.

It's also less work to implement, so I'll start with that and we can get more input later.

Would it be acceptable to relax the requirements after stabilization?

epage commented 9 months ago

From my understanding, it can be relaxed in the future as going from a panicking state to non-panicking shouldn't be breaking.

rust-lang / libs-team

ACP: A substring API for `OsStr` #306

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Examples

Guarding the internal encoding

Behavior on niche platforms

Alternatives

Links and related work

114

What happens now?

Possible responses