rust-lang / libs-team

The home of the library team
Apache License 2.0
110 stars 18 forks source link

ACP: Pattern methods for `OsStr` without `OsStr` patterns #311

Open epage opened 7 months ago

epage commented 7 months ago

Proposal

Problem statement

With rust-lang/rust#115443, developers, like those writing CLI parsers, can now perform (limited) operations on OsStr but it requires unsafe to get an OsStr back, requiring the developer to understand and follow some very specific safety notes that cannot be checked by miri.

RFC #2295 exists for improving this but its been stalled out. The assumption here is that part of the problem with that RFC is how wide its scope is and that by shrinking the scope, we can get some benefits now.

Motivating examples or use cases

Mostly copied from #306

Argument parsers need to extract substrings from command line arguments. For example, --option=somefilename needs to be split into option and somefilename, and the original filename must be preserved without sanitizing it.

clap currently implements strip_prefix and split_once using transmute (equivalent to the stable encoded_bytes APIs).

The os_str_bytes and osstrtools crates provides high-level string operations for OS strings. os_str_bytes is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3). osstrtools enables reasonable uses of split() to parse $PATH and replace() to fill in command line templates.

Solution sketch

Provide strs Pattern-accepting methods on &OsStr.

Defer out OsStr being used as a Pattern and OsStr indexing support which are specified in RFC #2295.

Example of methods to be added:

impl OsStr {
    pub fn contains<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>;

    pub fn starts_with<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>;

    pub fn ends_with<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn find<'a, P>(&'a self, pat: P) -> Option<usize>
    where
        P: Pattern<&'a Self>;

    pub fn rfind<'a, P>(&'a self, pat: P) -> Option<usize>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    // (Note: these should return a concrete iterator type instead of `impl Trait`.
    //  For ease of explanation the concrete type is not listed here.)
    pub fn split<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn split_inclusive<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplit<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn split_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplit_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn splitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn split_once<'a, P>(&'a self, delimiter: P) -> Option<(&'a Self, &'a Self)>where
        P: Pattern<&'a Self>;

    pub fn rsplit_once<'a, P>(&'a self, delimiter: P) -> Option<(&'a Self, &'a Self)>where
        P: Pattern<&'a Self>;

    pub fn matches<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rmatches<'a, P>(&self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn match_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)>
    where
        P: Pattern<&'a Self>;

    pub fn rmatch_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn trim_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>,
        P::Searcher: DoubleEndedSearcher<&'a Self>;

    pub fn trim_start_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>;

    pub fn strip_prefix<'a, P>(&'a self, prefix: P) -> Option<&'a Self> where
    P: Pattern<&'a Self>;

    pub fn strip_suffix<'a, P>(&'a self, prefix: P) -> Option<&'a Self> where
    P: Pattern<&'a Self>;

    pub fn trim_end_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn replace<'a, P>(&'a self, from: P, to: &'a Self) -> Self::Owned
    where
        P: Pattern<&'a Self>;

    pub fn replacen<'a, P>(&'a self, from: P, to: &'a Self, count: usize) -> Self::Owned
    where
        P: Pattern<&'a Self>;
}

impl Pattern<&OsStr> for char {}
impl Pattern<&OsStr> for &str {}
impl Pattern<&OsStr> for &String {}
impl Pattern<&OsStr> for &[char] {}
impl Pattern<&OsStr> for &&str {}
impl<const N: usize> Pattern<&OsStr> for &[char; N] {}
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
impl<const N: usize> Pattern<&OsStr> for [char; N] {}

This should work because

From an API design perspective, there is strong precedence for it

Alternatives

306 proposes a OsStr::slice_encoded_bytes

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

Second, if there's a concrete solution:

joshtriplett commented 7 months ago

This seems like a good idea to me. Putting these methods on OsStr will allow code to do simple parsing/splitting/etc in safe code. And because this does not add OsStr to Pattern, it remains a simple addition to API surface without any representation changes.

pitaj commented 7 months ago

What implements Pattern<&OsStr>?

epage commented 7 months ago

I've updated the proposal to call that out

impl Pattern<&OsStr> for &str {}
impl Pattern<&OsStr> for char {}
impl Pattern<&OsStr> for &[char] {}
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
impl Pattern<&OsStr> for &&str {}
impl Pattern<&OsStr> for &String {}

Basically, this is a direct mirror of &strs functionality. I don't think I've seen this called out anywhere in the previous RFCs but my assumption is non-UTF8 byte sequences in OsStr are just considered non-matches.

pitaj commented 7 months ago
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}

IIRC, this is currently forbidden due to coherence rules.

joshtriplett commented 7 months ago

@pitaj Presumably we'd do this the same way we already did for Pattern.

pitaj commented 7 months ago

AFAIK all of today's Patterns are defined in core, but OsStr lives in std. That's where the coherence issues come in.

joshtriplett commented 7 months ago

@pitaj Ah, I see. I think we have a mechanism that allows us to have impls of core traits in std without regard for the orphan rule, which would address that.

pitaj commented 7 months ago

Last I checked, there's a mechanism for inherent impls, but not for traits. That said, it's probably something that could be added.

programmerjake commented 7 months ago

we could put just the OsStr struct in core as an implementation detail, along with necessary trait impls, and put the rest in std with a re-export of OsStr...this would allow implementing:

impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
m-ou-se commented 7 months ago

We briefly discussed this in last week's libs-api meeting. While we agree it'd be good for OsStr to have a more complete API, we're worried about the amount of string types: should CStr and [ascii::Char] have the same api, for example?

It'd be good to first explore solutions that could benefit all string types, before continuing with this proposal to extend OsStr itself. For example, could a trait or another mechanism be used to make this api availabel to all string types?

epage commented 7 months ago

So if I understand correctly, the desire is to explore the design of a Pattern Extension trait with methods like contains, starts_with, etc that can apply to all string-like types?

Including CStr in this list is interesting because some methods, like strip_prefix, will require allocating into a CString. To be clear, that is expected and part of the design requirements for this?

programmerjake commented 7 months ago

Including CStr in this list is interesting because some methods, like strip_prefix, will require allocating into a CString.

i think you meant strip_suffix? strip_prefix should be able to return &CStr

epage commented 7 months ago

Good point, and figuring out how we should handle strip_prefix vs all other slicing operations adds another area for design exploration and discussions. Do we have all slicing operations treated the same, having CStr always return an owned type? Or do we hard code into the trait that only leading/middle/arbitrary content might be owned and trailing content is always borrowed.

For example, CStr::split_once could return either (Owned, &Borrowed) or (Owned, Owned) while split would always return impl Iterator<Item=Owned> (or maybe Cow?).

programmerjake commented 7 months ago

maybe CStr::split_once could return something more like (&[u8], &CStr)?

blyxxyz commented 7 months ago

CStr operations could return &[u8] and let the caller decide whether to turn it into CString (at the cost of a null byte check). Or they could mutate the original string to insert null bytes, strtok-style (probably a bad idea). Or there could be a CStrSlice type that has no terminator but is guaranteed to contain no null bytes...

Should function arguments be full-blown CStrs? It's important that they not have internal null bytes but for e.g. a search pattern there's no technical reason for the terminator, and in some cases that could require an allocation. (The recently stabilized C string literals would help with this at least.)

But mainly: Should CStr have this API at all? My experience is very limited but it does include a case where raw C string pointers were passed around arbitrarily when they should have been converted ASAP after the function call. How big of a use case is there for manipulating these instead of converting them immediately at the FFI boundary? Unlike OS strings they're just bytes with a validity requirement, so they're easier to convert.

Should [u8] get this API? Cstr would then get lossless access through to_bytes().

programmerjake commented 7 months ago

wouldn't CStrSlice just be [NonZeroU8] (except that's kinda hard to work with currently due to lacking lots of byte string methods)

tgross35 commented 6 months ago

I like the trait idea, especially since it will allow writing combinators that work for any string type and being able to reuse them on both CStrs from FFI and &strs from users.

Maybe a trait could look something like this (ignoring lifetimes):

trait SliceLikePattern: ToOwned {
    // Yes, we don't have associated type defaults...

    /// Result of splitting items
    type Partial = Self;

    /// Rightmost result of split items if different than `Partial`, e.g. for `CStr`
    type PartialRight = Self::Partial;

    /// Pattern type used when a single element will be extracted. `u8` for `&[u8]`,
    /// `str::pattern::Pattern` for str, maybe `u8` or `u16` for `OsStr`
    /// Or maybe  `FnMut(&u8) -> bool` for slices, as in `split_once`
    type ItemPattern;

    /// Pattern type used when a partial (slice) is expected, `&[u8]` for `&[u8]`
    /// still `str::pattern::Pattern` for `str`
    type PartialPattern;

    /// PartialPattern but if there is a specific right-first search
    /// e.g. str's `<P as Pattern<'a>>::Searcher: ReverseSearcher<'a>`
    type PartialPatternReverse = Self::PartialPattern;

    fn split_at(&self, mid: usize) -> (&Self::Partial, &Self::PartialRight);
    fn split_at_mut(&self, mid: usize) -> (&mut Self::Partial, &mut Self::PartialRight);
    fn contains<P: Self::ItemPattern>(&self, pat: P) -> bool;
    fn starts_with<P: Self::PartialPattern>(&self, pat: P) -> bool;
    fn ends_with<P: Self::PartialPatternReverse>(&self, pat: P) -> bool;
    fn find<P: Self::PartialPattern>(&self, pat: P) -> Option<usize>;
    fn rfind<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<usize>;
    fn split<P: Self::PartialPattern>(&self, pat: P) -> Split<P>;
    // ... similar variants of iterating splits and matches
    fn split_once<P: Self::ItemPattern>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
    fn rsplit_once<P: Self::ItemPatternReverse>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
    // I don't think we can do simple `trim_{start, end}` here or anything else that
    // relies on whitespace knowledge
    fn trim_start_matches<P: Self::PartialPattern>(&self, pat: P) -> &Self::PartialRight;
    fn trim_end_matches<P: Self::PartialPatternReverse>(&self, pat: P) -> &Self::Partial;
    fn strip_prefix<P: Self::PartialPattern>(&self, pat: P) -> Option<&Self::PartialRight>;
    fn strip_suffix<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<&Self::Partial>;
    fn replace<P: Self::PartialPattern>(&'a self, from: P, to: &Self::PartialRight) -> <Self as ToOwned>::ToOwned;
    fn repeat<P: Self::PartialPattern>(&'a self, from: P, repeat: usize) -> <Self as ToOwned>::ToOwned;
}

There probably isn't anything that restricts this to string-like types, I could see a lot of this being beneficial to let this apply to anything.