Open epage opened 7 months ago
This seems like a good idea to me. Putting these methods on OsStr will allow code to do simple parsing/splitting/etc in safe code. And because this does not add OsStr to Pattern, it remains a simple addition to API surface without any representation changes.
What implements Pattern<&OsStr>
?
I've updated the proposal to call that out
impl Pattern<&OsStr> for &str {}
impl Pattern<&OsStr> for char {}
impl Pattern<&OsStr> for &[char] {}
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
impl Pattern<&OsStr> for &&str {}
impl Pattern<&OsStr> for &String {}
Basically, this is a direct mirror of &str
s functionality. I don't think I've seen this called out anywhere in the previous RFCs but my assumption is non-UTF8 byte sequences in OsStr
are just considered non-matches.
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
IIRC, this is currently forbidden due to coherence rules.
@pitaj Presumably we'd do this the same way we already did for Pattern.
AFAIK all of today's Pattern
s are defined in core, but OsStr
lives in std. That's where the coherence issues come in.
@pitaj Ah, I see. I think we have a mechanism that allows us to have impls of core traits in std without regard for the orphan rule, which would address that.
Last I checked, there's a mechanism for inherent impls, but not for traits. That said, it's probably something that could be added.
we could put just the OsStr
struct in core
as an implementation detail, along with necessary trait impls, and put the rest in std
with a re-export of OsStr
...this would allow implementing:
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
We briefly discussed this in last week's libs-api meeting. While we agree it'd be good for OsStr to have a more complete API, we're worried about the amount of string types: should CStr and [ascii::Char]
have the same api, for example?
It'd be good to first explore solutions that could benefit all string types, before continuing with this proposal to extend OsStr itself. For example, could a trait or another mechanism be used to make this api availabel to all string types?
So if I understand correctly, the desire is to explore the design of a Pattern Extension trait with methods like contains
, starts_with
, etc that can apply to all string-like types?
Including CStr
in this list is interesting because some methods, like strip_prefix
, will require allocating into a CString
. To be clear, that is expected and part of the design requirements for this?
Including
CStr
in this list is interesting because some methods, likestrip_prefix
, will require allocating into aCString
.
i think you meant strip_suffix
? strip_prefix
should be able to return &CStr
Good point, and figuring out how we should handle strip_prefix
vs all other slicing operations adds another area for design exploration and discussions. Do we have all slicing operations treated the same, having CStr
always return an owned type? Or do we hard code into the trait that only leading/middle/arbitrary content might be owned and trailing content is always borrowed.
For example, CStr::split_once
could return either (Owned, &Borrowed)
or (Owned, Owned)
while split
would always return impl Iterator<Item=Owned>
(or maybe Cow
?).
maybe CStr::split_once
could return something more like (&[u8], &CStr)
?
CStr
operations could return &[u8]
and let the caller decide whether to turn it into CString
(at the cost of a null byte check). Or they could mutate the original string to insert null bytes, strtok
-style (probably a bad idea). Or there could be a CStrSlice
type that has no terminator but is guaranteed to contain no null bytes...
Should function arguments be full-blown CStr
s? It's important that they not have internal null bytes but for e.g. a search pattern there's no technical reason for the terminator, and in some cases that could require an allocation. (The recently stabilized C string literals would help with this at least.)
But mainly: Should CStr
have this API at all? My experience is very limited but it does include a case where raw C string pointers were passed around arbitrarily when they should have been converted ASAP after the function call. How big of a use case is there for manipulating these instead of converting them immediately at the FFI boundary? Unlike OS strings they're just bytes with a validity requirement, so they're easier to convert.
Should [u8]
get this API? Cstr
would then get lossless access through to_bytes()
.
wouldn't CStrSlice
just be [NonZeroU8]
(except that's kinda hard to work with currently due to lacking lots of byte string methods)
I like the trait idea, especially since it will allow writing combinators that work for any string type and being able to reuse them on both CStr
s from FFI and &str
s from users.
Maybe a trait could look something like this (ignoring lifetimes):
trait SliceLikePattern: ToOwned {
// Yes, we don't have associated type defaults...
/// Result of splitting items
type Partial = Self;
/// Rightmost result of split items if different than `Partial`, e.g. for `CStr`
type PartialRight = Self::Partial;
/// Pattern type used when a single element will be extracted. `u8` for `&[u8]`,
/// `str::pattern::Pattern` for str, maybe `u8` or `u16` for `OsStr`
/// Or maybe `FnMut(&u8) -> bool` for slices, as in `split_once`
type ItemPattern;
/// Pattern type used when a partial (slice) is expected, `&[u8]` for `&[u8]`
/// still `str::pattern::Pattern` for `str`
type PartialPattern;
/// PartialPattern but if there is a specific right-first search
/// e.g. str's `<P as Pattern<'a>>::Searcher: ReverseSearcher<'a>`
type PartialPatternReverse = Self::PartialPattern;
fn split_at(&self, mid: usize) -> (&Self::Partial, &Self::PartialRight);
fn split_at_mut(&self, mid: usize) -> (&mut Self::Partial, &mut Self::PartialRight);
fn contains<P: Self::ItemPattern>(&self, pat: P) -> bool;
fn starts_with<P: Self::PartialPattern>(&self, pat: P) -> bool;
fn ends_with<P: Self::PartialPatternReverse>(&self, pat: P) -> bool;
fn find<P: Self::PartialPattern>(&self, pat: P) -> Option<usize>;
fn rfind<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<usize>;
fn split<P: Self::PartialPattern>(&self, pat: P) -> Split<P>;
// ... similar variants of iterating splits and matches
fn split_once<P: Self::ItemPattern>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
fn rsplit_once<P: Self::ItemPatternReverse>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
// I don't think we can do simple `trim_{start, end}` here or anything else that
// relies on whitespace knowledge
fn trim_start_matches<P: Self::PartialPattern>(&self, pat: P) -> &Self::PartialRight;
fn trim_end_matches<P: Self::PartialPatternReverse>(&self, pat: P) -> &Self::Partial;
fn strip_prefix<P: Self::PartialPattern>(&self, pat: P) -> Option<&Self::PartialRight>;
fn strip_suffix<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<&Self::Partial>;
fn replace<P: Self::PartialPattern>(&'a self, from: P, to: &Self::PartialRight) -> <Self as ToOwned>::ToOwned;
fn repeat<P: Self::PartialPattern>(&'a self, from: P, repeat: usize) -> <Self as ToOwned>::ToOwned;
}
There probably isn't anything that restricts this to string-like types, I could see a lot of this being beneficial to let this apply to anything.
Proposal
Problem statement
With rust-lang/rust#115443, developers, like those writing CLI parsers, can now perform (limited) operations on
OsStr
but it requiresunsafe
to get anOsStr
back, requiring the developer to understand and follow some very specific safety notes that cannot be checked by miri.RFC #2295 exists for improving this but its been stalled out. The assumption here is that part of the problem with that RFC is how wide its scope is and that by shrinking the scope, we can get some benefits now.
Motivating examples or use cases
Mostly copied from #306
Argument parsers need to extract substrings from command line arguments. For example,
--option=somefilename
needs to be split into option andsomefilename
, and the original filename must be preserved without sanitizing it.clap
currently implementsstrip_prefix
andsplit_once
using transmute (equivalent to the stableencoded_bytes
APIs).The
os_str_bytes
andosstrtools
crates provides high-level string operations for OS strings.os_str_bytes
is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3).osstrtools
enables reasonable uses ofsplit()
to parse $PATH andreplace()
to fill in command line templates.Solution sketch
Provide
str
sPattern
-accepting methods on&OsStr
.Defer out
OsStr
being used as aPattern
andOsStr
indexing support which are specified in RFC #2295.Example of methods to be added:
str
and if there are any changes between the writing of this ACP and implementation, the focus should be on whatstr
has at the time of implementation (e.g. not adding a deprecated variant but the new one)trim
,trim_start
, andtrim_end
to be consistent withtrim_start_matches
/trim_end_matches
This should work because
Pattern
and, for now,Pattern
is nightly only, allowing a lot of flexibility for how we implementOsStr
support in the future (e.g. we could go as far as creating aOsPattern
trait and switching to it without breaking anyone)From an API design perspective, there is strong precedence for it
str
OsStr
as a pattern, we bypass the main dividing point between proposals (split APIs, panic on unpaired surrogates, switching away from WTF-8)Alternatives
306 proposes a
OsStr::slice_encoded_bytes
unsafe
Links and related work
306
114
Pattern
privateWhat happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution: