rust-lang / libs-team

The home of the library team
Apache License 2.0
132 stars 19 forks source link

slice contains subslice #499

Closed folkertdev closed 13 hours ago

folkertdev commented 1 week ago

Proposal

Problem statement

Determining whether a slice is contained within another slice, analogous to "foobar".contains("foo"), but for slices.

Motivating examples or use cases

I recently had cause to write

fn contains_osstr(haystack: impl AsRef<OsStr>, needle: impl AsRef<OsStr>) -> bool {
    let needle = needle.as_ref().as_encoded_bytes();
    let haystack = haystack.as_ref().as_encoded_bytes();

    haystack.windows(needle.len()).any(|h| h == needle)
}

Partially the problem here is the limited API on OsStr (and similarly Path and CStr), but I've wanted a "contains slice" operation on just standard slices too. It is especially odd that the contains operation is defined on &str, but not when you drop down to raw byte values.

I see two problems

Solution sketch

I think the nicest solution is to mirror the core::str::pattern design, so that we could have

impl<T> [T] {
    pub fn contains(&self, pat: P) -> bool
    where 
        T: PartialEq
        P: core::slice::pattern::Pattern<T>;

    pub fn find(&self, pat: P) -> Option<usize>
    where 
        T: PartialEq
        P: core::slice::pattern::Pattern<T>;
}

This appears to work out

trait Pattern<T:PartialEq>: Sized { /* ... */ }

impl<T: PartialEq> Pattern<T> for T { /* ... */ }

impl<'b, T: PartialEq> Pattern<T> for &'b [T] { /* ... */ }

// potentially arrays too, maybe even str, OsStr, CStr, and so on

And it looks backwards-compatible to me, but I'm not 100% sure that it is.

Some final notes:

Alternatives

This rustc issue https://github.com/rust-lang/rust/issues/54961 proposed instead

fn contains_subslice<T: PartialEq>(data: &[T], needle: &[T]) -> bool {
    data
    .windows(needle.len())
    .any(|w| w == needle)
}

fn position_subslice<T: PartialEq>(data: &[T], needle: &[T]) -> Option<usize> {
    data
    .windows(needle.len())
    .enumerate()
    .find(|&(_, w)| w == needle)
    .map(|(i, _)| i)
}

This idea works fine too if the pattern idea above has backwards compatibility issues. (though I'd suggest find_subslice for consistency).

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

Second, if there's a concrete solution:

programmerjake commented 1 week ago

iirc slice::pattern::Pattern is planned to be eventually merged with str::Pattern, however that could be a breaking change with this change:

let array = [Default::default(); 5];
// if this acts like `str::Pattern` then it matches a
// single character that is either 'a' or 'b'.
// if this acts like is proposed for `slice::pattern::Pattern`, then
// it matches the sequence 'a' followed by 'b' and forces `array: [char; 5]`.
array.contains(&['a', 'b']); 
BurntSushi commented 1 week ago

I don't think std should be in the business of providing arbitrary subsequence search.

I would be in favor of adding byte-substring search though, which appears to satisfy your use case. It is unclear why you took the step to generalize.

Could you say why you wrote naive substring search instead of using the memchr (or some other) crate?

folkertdev commented 1 week ago

iirc slice::pattern::Pattern is planned to be eventually merged with str::Pattern

Hmm yeah if that's still the plan then e.g. contains_slice or contains_subslice might be better.


I don't think std should be in the business of providing arbitrary subsequence search.

What makes it different from string subsequence search to you?

I would be in favor of adding byte-substring search though, which appears to satisfy your use case. It is unclear why you took the step to generalize.

Yes I think [u8] is by far the most common case. Limiting to that seemed weird to me (I don't think slice methods generally do that?) but hey if that works I'm fine with it.

Could you say why you wrote naive substring search instead of using the memchr (or some other) crate?

Well using a dependency just really didn't seem worth it here. I'm looking for something that is clear and quick and not terrible performance-wise. Something like memchr is clearly better if string search is a core part of your program, and then it makes sense to include it.

BurntSushi commented 1 week ago

What makes it different from string subsequence search to you?

Everything. It's an entirely different problem. Most substring search algorithms are tuned to small alphabets. As you point out, you would need specialization for the bytes case anyway. And on top of that, arbitrary subsequence search is an extremely niche problem. That should absolutely be in a crate, not std.

joshtriplett commented 1 week ago

In today's @rust-lang/libs-api meeting, we discussed at length the possible paths forward for this.

We had consensus that we do want to add most methods from ByteSlice and make them available on at least [u8]/Vec<u8>/etc.

We also observed that this was closely related to the similar desire on OsStr/OsString, and talked about that a bit, but we tried to narrow things down to see if we could get a consensus on byte slices specifically.

We spent a lot of time bikeshedding a few possible interfaces, and looking at the methods provided by bstr::ByteSlice.

Attempting to separate and summarize the different axes of the discussion (and using phrasing like "some folks felt that" here because in some cases not all of the separated-out aspects had clear positions from everyone present, so I'll leave it to individual members to comment on which positions they prefer):

Where to put the methods (which interacts with what to name them)

1) We could add these methods on just byte slices/arrays/vecs, and not handle the general T case.

a. We could name those methods `find`/`contains`/etc, to the extent they don't conflict with existing methods. Some folks felt that we should then not use method names like `find`/`contains`/etc because those would conflict with names we might want to use (or are already using) for more general methods on `[T]`.

b. We could add methods from `bstr::ByteSlice` with a `_bytes` suffix (e.g. `find_bytes`, `contains_bytes`). Some folks felt that the suffixes should be unnecessary and seemed unfortunate.

2) We could add these methods on an extension trait, and implement that extension trait for the slice types we want. We observed that bstr uses the ByteSlice extension trait for this. Some folks felt that doing this would still cause naming conflicts and semantic questions if we used names like find, though, and pushed back on using extension traits for "namespacing". Some folks also felt that bstr's use of extension traits seemed like a workaround for not being able to add new inherent methods from outside the standard library, and felt that we shouldn't have to do the same inside the standard library when we have the option of adding inherent methods instead.

3) We could add these methods on [T]/Vec<T>, have very basic implementations for the T case, and specialize to optimize the u8 case. We all felt that the T methods would occasionally be useful, but some folks felt that they might not be worth writing or having in the standard library, and there was support in the meeting for @BurntSushi's comments against covering the general [T] case.

4) We could add dedicated wrapper types like BStr/BString, and only add the methods there. This would have the advantage of semantically identifying which types are "kinda UTF-8". Some folks felt that this would be less convenient, however, and observed that bstr itself has moved away from this approach.

5) We could do a combination of 1b and 4: put the methods on slices and suffix them with _bytes, and if we add wrapper types in the future, add the methods there without the suffix. Some folks felt this added more complexity and surface area, though.

What the methods should accept

1) The methods could accept AsRef<[u8]>. This would allow most types people would want to pass. However, it wouldn't allow passing a predicate function (e.g. b == b'/' || b == b'\\'). It also wouldn't allow passing a set of bytes to efficiently search for. For cases that make sense to support bytesets, we could add _byteset methods (e.g. find_byteset).

2) The methods could accept a sealed trait BytePattern, and we could implement BytePattern for specific types, including everything that you can get a &[u8] from, as well as predicates from u8 to bool, and a future ByteSet type or similar. This would have the advantage of being extensible and convenient, and not requiring e.g. a separate find_byteset method. This would have the disadvantage that every method accepting a BytePattern would have to support every case.

joshtriplett commented 1 week ago

Taking off my team hat, and speaking personally:

For where to put the methods, I'd advocate approach 1b (add as inherent with _bytes suffixes), and possibly approach 5 in the future (add unsuffixed methods to future BStr/BString types). However, I wouldn't object to approach 3 (add with unsuffixed names and support T), with the caveat that I agree with @BurntSushi's comments and I don't think the T case is worth writing or supporting in the standard library. I'd push back on approaches 2 and 4.

For how to define the methods, I'm fine with either approach. I think for simplicity it's easy enough to add find_bytes and find_byteset as separate methods. I think a sealed BytePattern trait would be more elegant, as long as going that route doesn't entail a massive amount of additional bikeshedding.

folkertdev commented 1 day ago

That broadly sounds good to me. Approach 1b covers the vast majority of use cases. I do think that when the signature is limited to just u8 reflecting that in the function name is the right call . I still don't really see the problem with approach 3, but I guess that comes down to a personal judgement of when something feels right for the standard library. Given the list of possible additions in the meeting notes it is a substantial amount of code.

However, it looks like no consensus was reached on a particular way forward. So, what is stopping this from sitting around for another 6 years like the original rustc issue?

BurntSushi commented 1 day ago

4. We could add dedicated wrapper types like BStr/BString, and only add the methods there. This would have the advantage of semantically identifying which types are "kinda UTF-8". Some folks felt that this would be less convenient, however, and observed that bstr itself has moved away from this approach.

To elaborate a bit more on this option, bstr 0.1 had this design. It had no extension traits and instead just defined new string types. I really liked this, and honestly, still do. But when I went to actually use it, it had really horrendous ergonomics because so many things want to use &[u8]. So you end up needing to convert quite a lot. And it was just overall annoying. Enough so that I scrapped that entire approach and moved to extension traits. I kept the BString and BStr types (that deref to the obvious types) around because they are useful as targets of trait impls (like for Serde and, my favorite, Debug). But they aren't the centerpiece of the crate like they were in 0.1. I elaborated more on this in the bstr issue tracker.

I'm open to the idea of adding BString and BStr to std, but probably in the same sense that bstr uses them (as a trait target that deref to Vec<u8> and [u8], respectively) rather than as a separate type where we implement a bunch of byte string methods.

I like 1b, and the shenanigans with BytePattern seem worth at least experimenting with. I'd like to accept the ACP on those grounds. I wasn't at the meeting so I don't have a good sense of how strongly folks felt about that. To be clear, I agree that using a _bytes suffix is annoying. If there's any objection to that, please chime in, otherwise I'll circle back around to formally accept this in a week or so.

I do sympathize with the notion that if we provided arbitrary subsequence search, then we could probably use better (i.e., more general) names. But I am very skeptical about adding those to std. While I might have a lot of experience with the &[u8]/&str case, I don't have a lot of experience with the general case. I'd in particular be worried about providing a sub-optimal API. I think the thing that concerns me the most is the alphabet size. Depending on how we design our API (e.g., std tends to not care about amortizing searcher construction and this can really matter), we might lock ourselves into an API that needs to build the searcher for every call, which can be pretty constraining in terms of which implementations you're allowed to use. Anyway, I don't mean this to be an exhaustive treatment, but there's enough questions here in my mind, and the use case is niche enough, that I think the general case is really better left to a domain expert building a crate for it.

the8472 commented 1 day ago

To be clear, I agree that using a _bytes suffix is annoying.

My position on this is that if the methods only take a plain needle, i.e. [T].contains([T]) then we can keep them generic (methods-3). If we take a impl BytePattern (accepts-2) then having suffixed methods (methods-1b) does make at least some sense.

I don't have a lot of experience with the general case. I'd in particular be worried about providing a sub-optimal API. I think the thing that concerns me the most is the alphabet size. Depending on how we design our API (e.g., std tends to not care about amortizing searcher construction and this can really matter), we might lock ourselves into an API that needs to build the searcher for every call, which can be pretty constraining in terms of which implementations you're allowed to use.

Agreed that we likely won't achieve optimal performance in the general case if we support just searching for any [T] where T: PartialEq. You can't get an alphabet size from that anyway. But std does not always have to provide the optimized-special-case-performance. Iterators also do leave perf on the table due to being too conservative (short-circuiting, side-effects, ...) and people write crates like iterator_ilp to squeeze out some more.

So providing a better-than-naive-but-not-perfect implementation can still be worthwhile if it provides enough convenience.

BurntSushi commented 1 day ago

So providing a better-than-naive-but-not-perfect implementation can still be worthwhile if it provides enough convenience.

I don't think enough convenience is provided by such an API. General subsequence search is incredibly niche, unlike iterators. I don't think I've reached for it once in my decades of programming at least. A cursory search of crates.io seems to support this. The only crates I could find providing such functionality are galil-seiferas and subslice (with the latter requiring T: Ord but the former only needing T: Eq). Neither of which appear to be commonly used.

It's worth noting that if we do go with methods-1b, then the naming still leaves room for the general case. It wouldn't be ideal, but it doesn't totally cut out that future expansion should we ever want to explore it. (Although I of course remain deeply skeptical.)

I'm kind of meh on the Pattern trait stuff we have in std (which is why I didn't go that route for bstr), but I get that it avoids a explosion of methods. So for that reason, and for reasons of consistency, I'd be in favor of doing something similar (that is, accepts-2) for byte string search too. So I think that at least aligns me and you on methods-1b and accepts-2.

joshtriplett commented 13 hours ago

We discussed this in today's @rust-lang/libs-api meeting, and we agreed that we should go forward with methods-1b (add byte-specific methods with _bytes suffixes) and try accepts-2 (try BytePattern for now, evaluate at stabilization time). We can consider methods-5 (unsuffixed methods on wrapper types) if and when we add BStr/BString wrapper types.

Note that BytePattern should be sealed.

joshtriplett commented 13 hours ago

Marking as accepted in that form.

joshtriplett commented 13 hours ago

@Amanieu wrote up a quick list of potential candidate methods:

In general, we should consider any reasonable method from bstr::ByteSlice.

tgross35 commented 12 hours ago
* `split_once_bytes`
* `rsplit_once_bytes`
* `trim_start_matches_bytes`
* `trim_end_matches_bytes`
* `strip_prefix_bytes`
* `strip_suffix_bytes`

Minor comment: I think these sound better with the wording flipped, split_bytes_once trim_bytes_start_matches strip_byte_prefix etc.

folkertdev commented 8 hours ago

I made the tracking issue https://github.com/rust-lang/rust/issues/134149, and I'll start the implementation work soon. Likely first getting the trait right with only a couple methods, then fleshing out the rest of the api separately.