Closed folkertdev closed 13 hours ago
iirc slice::pattern::Pattern
is planned to be eventually merged with str::Pattern
, however that could be a breaking change with this change:
let array = [Default::default(); 5];
// if this acts like `str::Pattern` then it matches a
// single character that is either 'a' or 'b'.
// if this acts like is proposed for `slice::pattern::Pattern`, then
// it matches the sequence 'a' followed by 'b' and forces `array: [char; 5]`.
array.contains(&['a', 'b']);
I don't think std should be in the business of providing arbitrary subsequence search.
I would be in favor of adding byte-substring search though, which appears to satisfy your use case. It is unclear why you took the step to generalize.
Could you say why you wrote naive substring search instead of using the memchr
(or some other) crate?
iirc
slice::pattern::Pattern
is planned to be eventually merged withstr::Pattern
Hmm yeah if that's still the plan then e.g. contains_slice
or contains_subslice
might be better.
I don't think std should be in the business of providing arbitrary subsequence search.
What makes it different from string subsequence search to you?
I would be in favor of adding byte-substring search though, which appears to satisfy your use case. It is unclear why you took the step to generalize.
Yes I think [u8]
is by far the most common case. Limiting to that seemed weird to me (I don't think slice methods generally do that?) but hey if that works I'm fine with it.
Could you say why you wrote naive substring search instead of using the memchr (or some other) crate?
Well using a dependency just really didn't seem worth it here. I'm looking for something that is clear and quick and not terrible performance-wise. Something like memchr
is clearly better if string search is a core part of your program, and then it makes sense to include it.
What makes it different from string subsequence search to you?
Everything. It's an entirely different problem. Most substring search algorithms are tuned to small alphabets. As you point out, you would need specialization for the bytes case anyway. And on top of that, arbitrary subsequence search is an extremely niche problem. That should absolutely be in a crate, not std.
In today's @rust-lang/libs-api meeting, we discussed at length the possible paths forward for this.
We had consensus that we do want to add most methods from ByteSlice
and make them available on at least [u8]
/Vec<u8>
/etc.
We also observed that this was closely related to the similar desire on OsStr
/OsString
, and talked about that a bit, but we tried to narrow things down to see if we could get a consensus on byte slices specifically.
We spent a lot of time bikeshedding a few possible interfaces, and looking at the methods provided by bstr::ByteSlice
.
Attempting to separate and summarize the different axes of the discussion (and using phrasing like "some folks felt that" here because in some cases not all of the separated-out aspects had clear positions from everyone present, so I'll leave it to individual members to comment on which positions they prefer):
1) We could add these methods on just byte slices/arrays/vecs, and not handle the general T
case.
a. We could name those methods `find`/`contains`/etc, to the extent they don't conflict with existing methods. Some folks felt that we should then not use method names like `find`/`contains`/etc because those would conflict with names we might want to use (or are already using) for more general methods on `[T]`.
b. We could add methods from `bstr::ByteSlice` with a `_bytes` suffix (e.g. `find_bytes`, `contains_bytes`). Some folks felt that the suffixes should be unnecessary and seemed unfortunate.
2) We could add these methods on an extension trait, and implement that extension trait for the slice types we want. We observed that bstr
uses the ByteSlice
extension trait for this. Some folks felt that doing this would still cause naming conflicts and semantic questions if we used names like find
, though, and pushed back on using extension traits for "namespacing". Some folks also felt that bstr
's use of extension traits seemed like a workaround for not being able to add new inherent methods from outside the standard library, and felt that we shouldn't have to do the same inside the standard library when we have the option of adding inherent methods instead.
3) We could add these methods on [T]
/Vec<T>
, have very basic implementations for the T
case, and specialize to optimize the u8
case. We all felt that the T
methods would occasionally be useful, but some folks felt that they might not be worth writing or having in the standard library, and there was support in the meeting for @BurntSushi's comments against covering the general [T]
case.
4) We could add dedicated wrapper types like BStr
/BString
, and only add the methods there. This would have the advantage of semantically identifying which types are "kinda UTF-8". Some folks felt that this would be less convenient, however, and observed that bstr
itself has moved away from this approach.
5) We could do a combination of 1b and 4: put the methods on slices and suffix them with _bytes
, and if we add wrapper types in the future, add the methods there without the suffix. Some folks felt this added more complexity and surface area, though.
1) The methods could accept AsRef<[u8]>
. This would allow most types people would want to pass. However, it wouldn't allow passing a predicate function (e.g. b == b'/' || b == b'\\'
). It also wouldn't allow passing a set of bytes to efficiently search for. For cases that make sense to support bytesets, we could add _byteset
methods (e.g. find_byteset
).
2) The methods could accept a sealed trait BytePattern
, and we could implement BytePattern
for specific types, including everything that you can get a &[u8]
from, as well as predicates from u8
to bool
, and a future ByteSet
type or similar. This would have the advantage of being extensible and convenient, and not requiring e.g. a separate find_byteset
method. This would have the disadvantage that every method accepting a BytePattern
would have to support every case.
Taking off my team hat, and speaking personally:
For where to put the methods, I'd advocate approach 1b (add as inherent with _bytes
suffixes), and possibly approach 5 in the future (add unsuffixed methods to future BStr
/BString
types). However, I wouldn't object to approach 3 (add with unsuffixed names and support T
), with the caveat that I agree with @BurntSushi's comments and I don't think the T
case is worth writing or supporting in the standard library. I'd push back on approaches 2 and 4.
For how to define the methods, I'm fine with either approach. I think for simplicity it's easy enough to add find_bytes
and find_byteset
as separate methods. I think a sealed BytePattern
trait would be more elegant, as long as going that route doesn't entail a massive amount of additional bikeshedding.
That broadly sounds good to me. Approach 1b covers the vast majority of use cases. I do think that when the signature is limited to just u8
reflecting that in the function name is the right call . I still don't really see the problem with approach 3, but I guess that comes down to a personal judgement of when something feels right for the standard library. Given the list of possible additions in the meeting notes it is a substantial amount of code.
However, it looks like no consensus was reached on a particular way forward. So, what is stopping this from sitting around for another 6 years like the original rustc issue?
4. We could add dedicated wrapper types like
BStr
/BString
, and only add the methods there. This would have the advantage of semantically identifying which types are "kinda UTF-8". Some folks felt that this would be less convenient, however, and observed thatbstr
itself has moved away from this approach.
To elaborate a bit more on this option, bstr 0.1
had this design. It had no extension traits and instead just defined new string types. I really liked this, and honestly, still do. But when I went to actually use it, it had really horrendous ergonomics because so many things want to use &[u8]
. So you end up needing to convert quite a lot. And it was just overall annoying. Enough so that I scrapped that entire approach and moved to extension traits. I kept the BString
and BStr
types (that deref to the obvious types) around because they are useful as targets of trait impls (like for Serde and, my favorite, Debug
). But they aren't the centerpiece of the crate like they were in 0.1
. I elaborated more on this in the bstr
issue tracker.
I'm open to the idea of adding BString
and BStr
to std
, but probably in the same sense that bstr
uses them (as a trait target that deref to Vec<u8>
and [u8]
, respectively) rather than as a separate type where we implement a bunch of byte string methods.
I like 1b, and the shenanigans with BytePattern
seem worth at least experimenting with. I'd like to accept the ACP on those grounds. I wasn't at the meeting so I don't have a good sense of how strongly folks felt about that. To be clear, I agree that using a _bytes
suffix is annoying. If there's any objection to that, please chime in, otherwise I'll circle back around to formally accept this in a week or so.
I do sympathize with the notion that if we provided arbitrary subsequence search, then we could probably use better (i.e., more general) names. But I am very skeptical about adding those to std. While I might have a lot of experience with the &[u8]
/&str
case, I don't have a lot of experience with the general case. I'd in particular be worried about providing a sub-optimal API. I think the thing that concerns me the most is the alphabet size. Depending on how we design our API (e.g., std tends to not care about amortizing searcher construction and this can really matter), we might lock ourselves into an API that needs to build the searcher for every call, which can be pretty constraining in terms of which implementations you're allowed to use. Anyway, I don't mean this to be an exhaustive treatment, but there's enough questions here in my mind, and the use case is niche enough, that I think the general case is really better left to a domain expert building a crate for it.
To be clear, I agree that using a _bytes suffix is annoying.
My position on this is that if the methods only take a plain needle, i.e. [T].contains([T])
then we can keep them generic (methods-3). If we take a impl BytePattern
(accepts-2) then having suffixed methods (methods-1b) does make at least some sense.
I don't have a lot of experience with the general case. I'd in particular be worried about providing a sub-optimal API. I think the thing that concerns me the most is the alphabet size. Depending on how we design our API (e.g., std tends to not care about amortizing searcher construction and this can really matter), we might lock ourselves into an API that needs to build the searcher for every call, which can be pretty constraining in terms of which implementations you're allowed to use.
Agreed that we likely won't achieve optimal performance in the general case if we support just searching for any [T] where T: PartialEq
. You can't get an alphabet size from that anyway.
But std does not always have to provide the optimized-special-case-performance. Iterators also do leave perf on the table due to being too conservative (short-circuiting, side-effects, ...) and people write crates like iterator_ilp to squeeze out some more.
So providing a better-than-naive-but-not-perfect implementation can still be worthwhile if it provides enough convenience.
So providing a better-than-naive-but-not-perfect implementation can still be worthwhile if it provides enough convenience.
I don't think enough convenience is provided by such an API. General subsequence search is incredibly niche, unlike iterators. I don't think I've reached for it once in my decades of programming at least. A cursory search of crates.io seems to support this. The only crates I could find providing such functionality are galil-seiferas
and subslice
(with the latter requiring T: Ord
but the former only needing T: Eq
). Neither of which appear to be commonly used.
It's worth noting that if we do go with methods-1b
, then the naming still leaves room for the general case. It wouldn't be ideal, but it doesn't totally cut out that future expansion should we ever want to explore it. (Although I of course remain deeply skeptical.)
I'm kind of meh on the Pattern
trait stuff we have in std
(which is why I didn't go that route for bstr
), but I get that it avoids a explosion of methods. So for that reason, and for reasons of consistency, I'd be in favor of doing something similar (that is, accepts-2
) for byte string search too. So I think that at least aligns me and you on methods-1b
and accepts-2
.
We discussed this in today's @rust-lang/libs-api meeting, and we agreed that we should go forward with methods-1b (add byte-specific methods with _bytes
suffixes) and try accepts-2 (try BytePattern
for now, evaluate at stabilization time). We can consider methods-5 (unsuffixed methods on wrapper types) if and when we add BStr/BString wrapper types.
Note that BytePattern
should be sealed.
Marking as accepted in that form.
@Amanieu wrote up a quick list of potential candidate methods:
contains_bytes
find_bytes
rfind_bytes
split_bytes
rsplit_bytes
split_once_bytes
rsplit_once_bytes
splitn_bytes
rsplitn_bytes
replace_bytes
replacen_bytes
starts_with_bytes
ends_with_bytes
matches_bytes
rmatches_bytes
match_indices_bytes
rmatch_indices_bytes
trim_start_matches_bytes
trim_end_matches_bytes
strip_prefix_bytes
strip_suffix_bytes
In general, we should consider any reasonable method from bstr::ByteSlice
.
* `split_once_bytes` * `rsplit_once_bytes` * `trim_start_matches_bytes` * `trim_end_matches_bytes` * `strip_prefix_bytes` * `strip_suffix_bytes`
Minor comment: I think these sound better with the wording flipped, split_bytes_once
trim_bytes_start_matches
strip_byte_prefix
etc.
I made the tracking issue https://github.com/rust-lang/rust/issues/134149, and I'll start the implementation work soon. Likely first getting the trait right with only a couple methods, then fleshing out the rest of the api separately.
Proposal
Problem statement
Determining whether a slice is contained within another slice, analogous to
"foobar".contains("foo")
, but for slices.Motivating examples or use cases
I recently had cause to write
Partially the problem here is the limited API on
OsStr
(and similarlyPath
andCStr
), but I've wanted a "contains slice" operation on just standard slices too. It is especially odd that the contains operation is defined on&str
, but not when you drop down to raw byte values.I see two problems
Solution sketch
I think the nicest solution is to mirror the
core::str::pattern
design, so that we could haveThis appears to work out
And it looks backwards-compatible to me, but I'm not 100% sure that it is.
Some final notes:
u8
andT: Copy
)Alternatives
This rustc issue https://github.com/rust-lang/rust/issues/54961 proposed instead
This idea works fine too if the pattern idea above has backwards compatibility issues. (though I'd suggest
find_subslice
for consistency).Links and related work
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution: