rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.58k stars 12.62k forks source link

Tracking issue for string patterns #27721

Open alexcrichton opened 9 years ago

alexcrichton commented 9 years ago

(Link to original RFC: https://github.com/rust-lang/rfcs/pull/528)

This is a tracking issue for the unstable pattern feature in the standard library. We have many APIs which support the ability to search with any number of patterns generically within a string (e.g. substrings, characters, closures, etc), but implementing your own pattern (e.g. a regex) is not stable. It would be nice if these implementations could indeed be stable!

Some open questions are:

cc @Kimundi

bluss commented 9 years ago

String Patterns RFC tracking issue: #22477

Stabilization not only impacts implementing the pattern traits, but also of course detailed use of the Searcher trait, .next_match(), .next_reject() and so on.

jneem commented 9 years ago

I'm having trouble seeing the purpose of the next() method; in all the examples I look at, it's faster and cleaner to just implement next_match() and next_reject() individually. For example, these two functions can be implemented for CharEqSearcher as one-liners using Iterator::find. Moreover, if you want an optimized implementation of Searcher for an ASCII char then in implementing next() you need to choose between an implementation that returns rejections quickly and an implementation that skips quickly over the input (e.g. using SIMD) in order to quickly find a match.

bluss commented 9 years ago

@jneem the StrSearcher uses the same code for next() and next_match(), but specialized for each case, so that the next_match() case is much faster. It works well.

jneem commented 9 years ago

@bluss I think that helps make my point: AFAICT, all the users of Searcher only call next_match() and next_reject(). Therefore, the only purpose of next() AFAICT is to make implementing Searcher easier -- you only need to implement one function instead of two. But that benefit isn't born out in practice. StrSearcher implements two methods anyway, and it would be simpler to implement next_reject() instead of next(). CharEqSearcher implements next() only, but it would be simpler and cleaner to optimize if it implemented next_match() and next_reject() instead.

jneem commented 9 years ago

Oops, I missed one: Pattern uses Searcher::next() for the default implementation of is_prefix_of.

mkpankov commented 8 years ago

This is useful, I'd like to use it on stable.

bluss commented 8 years ago

cc @BurntSushi, Regex is the major pattern user and it would be great for everyone if it was stable because of that.

bluss commented 8 years ago

@BurntSushi Are the Pattern traits sufficient for Regex? Any design issues?

BurntSushi commented 8 years ago

I wrote the impl for Regex a while back, which does indeed only implement next: https://github.com/rust-lang-nursery/regex/blob/master/src/re.rs#L1104 I seem to recall that it was tricky to get the corner cases right, but that isn't too surprising. I otherwise found it pretty straight forward.

One thing that I don't think is representable with the Pattern API is fast retrieval of all matches from a suffix table. In particular, matches are reported as part of lexicographically sorted suffixes rather than order of occurrence in the string, so it can't satisfy the contract of Searcher without an additional cost. (I've already talked to @Kimundi about this, and I don't think it's a major issue. Suffix tables are a pretty niche data structure.)

aturon commented 8 years ago

Nominating for discussion. Not sure whether this is ready for stabilization, but I'd like for the team to dig into it a bit.

alexcrichton commented 8 years ago

Unfortunately the libs team didn't get a chance to talk about this in terms of stabilization for 1.8, but I'm going to leave the nominated tag as I think we should chat about this regardless.

Kimundi commented 8 years ago

Today I investigated a bit what would need to be done to generalize the Pattern API to arbitrary slice types. As part of that I also took a more pragmatic route to the involved types and interfaces based on these assumptions:

A very rough sketch of the result can be seen below. Note that this is only the Pattern traits itself, without actual integration in the std lib or comprehensive re implementation of the existing types.

https://github.com/Kimundi/pattern_api_sketch/blob/master/src/v5.rs

The core changes are:


Changes in regard to the open questions:

bluss commented 8 years ago

Is the focus of Pattern for slices going to be specific for &[u8] only? I think that's ok, I'm unsure how to really extend "substring search" further to generic &[T]. Technically, the two-way algorithm that we use for str::find(&str) can be made to work on any ordered alphabet, i.e. T: Ord is enough, but I don't know how performant or realistic this is.

Kimundi commented 8 years ago

@bluss: I only used &[u8] as a quick proof of concept that the traits work with different slice types. I'm assuming that there is some sensible way to make it work with generic slices.

The &[u8] case seems like a good candidate for a specialization impl though.

DanielKeep commented 8 years ago

I'm currently trying out some runtime scanners for scan-rules. Basically: I want users to be able to parse text based on Patterns (e.g. accept anything that matches this pattern, consume input until this pattern is matched, etc.). The current example is being able to do something like (where until accepts a Pattern):

let_scan!("before:after", (let word_0 <| until(":"), ":", let word_1: Everything));
assert_eq!(word_0, "before");
assert_eq!(word_1, "after");

The problem I'm having is that it's really painful to actually do this. The interface as it exists seems to assume that ownership of the pattern can be passed to the method which will use it, and as a result, can only be used exactly once. This doesn't make much sense to me. Searching for a pattern should not (in general) consume the pattern.

What I want is P: Patternfor<P: Pattern> &P: Pattern. Currently, it's only true for &str. If I had this, I could take ownership of the pattern from the caller, store it, and loan it to the underlying find calls. If the caller wants to use the pattern in multiple places, and it's expensive to construct, they can instead pass my function a borrow, which will also work.

The more I think about this, the more it comes down to the FnMut implementation. The only closure kind that would allow for the kind of interface I want is Fn... but that excludes closures that test based on accumulated state, though I wonder how often that's even desireable.

I can work around this by requiring Copy patterns, but again, callables are the most useful kind of pattern (until(char::is_whitespace), etc.), so it seems a deep shame to exclude them.

DanielKeep commented 8 years ago

Oh, another thing I just realised: there doesn't appear to be any way to find out how long a match is given a pattern and, say, str::starts_with.

bluss commented 8 years ago

You can use .next_reject for that.

withoutboats commented 8 years ago

Hey, I don't know a lot about this API, but I noticed that StrSearcher is not a double ended pattern, but CharSliceSearcher is. Is this an error? The explanation of why StrSearcher is not double ended seems to apply equally well to CharSliceSearcher:

(&str)::Searcher is not a DoubleEndedSearcher because the pattern "aa" in the haystack "aaa" matches as either "[aa]a" or "a[aa]", depending from which side it is searched.

bluss commented 8 years ago

@withoutboats A slice of chars represents a set of possibilities, so it's not like a string; either of the chars can be matched by themselves.

withoutboats commented 8 years ago

@bluss That makes sense. The semantics of CharSliceSearcher are not documented; I assumed the char slice was treated as an ordered sequence rather than a set.

Stebalien commented 8 years ago

Can we say that, after returning Done once, a Searcher must continue to return Done forever more? That is, can I make this assumption when implementing FusedIterator?

shahn commented 7 years ago

I'm a big fan of extending this to be more generic than just str, needed/wanted it for [u8] quite frequently. Unfortunately, there's an API inconsistency already - str::split takes a Pattern, whereas slice::split takes a predicate function that only looks at a single T in isolation.

RReverser commented 6 years ago

@shahn That shouldn't be a big problem - if Pattern is implemented for such function, then slice::split could support Pattern without breaking backwards compatibility.

haudan commented 6 years ago

I strongly agree with widening the API to slice and maybe Vec.

bstrie commented 6 years ago

What's the status here? The original RFC is quite old -- from 2014, before 1.0 -- do we need to completely revisit the design? What are the current blockers?

shepmaster commented 6 years ago

With SIMD stabilizing in 1.27, Jetscii will be available on stable, but its Pattern support won't — I never expected that SIMD would arrive before this! 😇

jeekobu commented 6 years ago

Would be nice to see these for OsStr, CStr, [u8] and similar, as mentioned a year ago. Non-UTF in Rust feels very much a second class citizen and this would be a start.

SimonSapin commented 6 years ago

@jeekobu https://github.com/rust-lang/rust/issues/49802 is accepted but not yet implemented.

Mingun commented 5 years ago

I think, that Searcher need skip method(s) that can be used to advance search positions without performing search. For example, proof-of-concept of inplace unescape requires this method to work correctly.

ExpHP commented 5 years ago

I'm looking at this and there doesn't seem to be any way to efficiently perform a search for a single needle in many haystacks? The only way to make a Searcher currently is by starting with a str for both the needle and the haystack. This leads to an expensive call to ToWaySearcher::new which performs computations that only seem to depend on the needle.

(see how expensive these calls are here: https://users.rust-lang.org/t/why-my-rust-code-is-2-times-slower-than-my-js-code/31189/8)

But it doesn't look like this is impossible with the current API. It looks to me like a new public type StrPattern could be added which implements Pattern and contains these precomputed results?

BurntSushi commented 5 years ago

Aho-Corasick (or similar) is probably the right answer there, but the bstr crate does provide a way to do what you want (including not needing to do utf8 validation, if that helps): https://docs.rs/bstr/0.2.7/bstr/struct.Finder.html

I do think it is a somewhat niche concern though. Consider that libc APIs like memmem don't permit this either.

ibraheemdev commented 3 years ago

Why is Pattern not implemented for String?

bstrie commented 3 years ago

It is implemented for &String, is there a reason to implement it for String as well?

Yam76 commented 3 years ago

Sorry if there is already a way to do this, but as part of this issue, can Pattern be implemented for &[&str] to avoid allocation? (IIRC slices avoid allocation) e.g.

let s = "opt=val";
let x = &["opt", "="];
assert!(s.starts_with(x));
jhpratt commented 3 years ago

That could conceivably work in either of two ways: what I suspect you want (one followed by the other) or one or the other. That by itself is probably reason enough to avoid that impl.

You'd still be able to avoid allocations by using .as_bytes(), though.

Yam76 commented 3 years ago

I think it's reasonable to prefer "one followed by the other" over "one or the other". "one or the other" can be done "relatively" cleanly by iterating over the strs, e.g.

let x = &["apt", "apc"];
assert_eq!(
x.map(|e| s.starts_with(e)).fold(|acc, x| acc || x), // wrote this quickly
s.starts_with(x) // "one or the other"
);

Meanwhile "one followed by the other" needs to use as_bytes (as you said) to avoid allocations, which is much more opaque (IMO).

As a further point, &[char] looks to see if any char in the slice is in the string rather than if the characters in the slice are in the string in order, so I don't know if conceivably working in two different ways is a good justification for not implementing, since &[char] chooses one anyways (though admittedly it chooses "one or the other").

bluss commented 3 years ago

One or the other seems rather more intuitive. Maybe there should be wrapper newtypes or an enum to separate the cases (it would make for more explicit code, albeit less discoverable). .contains(AllInOrder(["a", "b"])) or .contains(Any(["a", "b", "c"])) and so on.

With following in sequence there is also the question of - is it about following each other exactly (no gap between the matches) or not - for .contains() both alternatives could be useful - it's like contains all (any order), contains all (in order), contains any etc.

Yam76 commented 3 years ago

Wrapper newtypes sound good. They are probably more useful than enums in this case since the accepting function can distinguish newtypes at the type level and newtypes are more easily extendable, as the different cases of patterns don't seem to be super well enumerated. Maybe implementing some reasonably common ones like Or (one or the other), ExactOrder (one after the other, no gaps), and InOrder (one after the other, gaps), so users don't have to get their hands into unsafe Searcher would be good. Then adding some examples to Pattern to make them more discoverable.

This would let us have default implementations for types like &[char] while also getting rid of concerns about working in several different ways, since we can just disambiguate with the newtypes. It also opens up nesting, which could be useful. (Though I'll mention that we seem to be slowly approaching regex.)

BurntSushi commented 3 years ago

FWIW, I would probably be opposed to implementing the "or" variant in std on strings. The simple implementation of this is a performance footgun as soon as it gets beyond a few patterns. We would invariably find ourselves recommending people "not use it" unless the inputs are small. The alternative, to make it fast, is a lot of work that probably shouldn't live inside std: https://github.com/BurntSushi/aho-corasick/blob/master/DESIGN.md

Luro02 commented 3 years ago

This might be obvious for some of you, but it has not been mentioned yet:

Before stabilizing the API (seems like this might take some time) one should consider waiting for https://github.com/rust-lang/rust/issues/44265:

use core::str::pattern::{ReverseSearcher, Searcher};

pub trait Pattern {
    type Searcher<'a>: Searcher<'a>;

    // lifetime can be elided:
    fn into_searcher(self, haystack: &str) -> Self::Searcher<'_>;

    //

    fn is_contained_in(self, haystack: &str) -> bool;

    fn is_prefix_of(self, haystack: &str) -> bool;

    fn is_suffix_of<'a>(self, haystack: &'a str) -> bool
    where
        Self::Searcher<'a>: ReverseSearcher<'a>;

    fn strip_prefix_of(self, haystack: &str) -> Option<&'_ str>;

    fn strip_suffix_of<'a>(self, haystack: &'a str) -> Option<&'a str>
    where
        Self::Searcher<'a>: ReverseSearcher<'a>;
}

I think using an associated lifetime for the Searcher makes more sense, because it depends on the lifetime of the haystack, which is passed as a parameter to the function.

I can not really think of any practical benefits of this approach, except for lifetime elision and patterns without lifetimes:

pub struct SomePattern;

impl Pattern for SomePattern {
    type Searcher<'a> = SomeSearcher<'a>;

    fn into_searcher(self, haystack: &str) -> Self::Searcher<'_> {
        SomeSearcher(haystack)
    }
}

pub struct SomeSearcher<'a>(&'a str);

unsafe impl<'a> Searcher<'a> for SomeSearcher<'a> {
    fn haystack(&self) -> &'a str {
        self.0
    }

    fn next(&mut self) -> SearchStep {
        unimplemented!()
    }
}
TonalidadeHidrica commented 2 years ago

Pattern is implemented for FnMut(char) -> bool. Is it allowed that this function returns different values for the same input? Specifically, say that we want to strip at most three a's from a string, then currently we may prepare a counter so that it returns true for input a while the counter is less than 3 or otherwise false. Is it an intended use? Particularly, is it guaranteed that FnMut(char) -> bool function is called for each character in order?

jhpratt commented 2 years ago

@TonalidadeHidrica I wouldn't rely on that myself. To me that's not an idiomatic use case; I'd expect the function to be pure.

Luro02 commented 2 years ago

@TonalidadeHidrica I wouldn't rely on that myself. To me that's not an idiomatic use case; I'd expect the function to be pure.

Why has this trait been implemented for FnMut? If the function should be pure, then wouldn't it be better to only implement it for Fn?

From the docs of FnMut:

Use FnMut as a bound when you want to accept a parameter of function-like type and need to call it repeatedly, while allowing it to mutate state. If you don’t want the parameter to mutate state, use Fn as a bound; if you don’t need to call it repeatedly, use FnOnce.

https://doc.rust-lang.org/std/ops/trait.FnMut.html

Luro02 commented 2 years ago

This seems like something that should be documented in the docs of the Pattern trait.

TonalidadeHidrica commented 2 years ago

I agree that this should be documented. I guess why this function is Fn instead of FnMut is that we may sometimes use something like "cache" of a large database of chars, the update of which requires mutations. The issue of use of FnMut is somewhat similar to "Iterator::map accepts FnMut, then can I update an external counter during the iteration of map? No, it's discouraged."

Mingun commented 2 years ago

Would it be better to ban the mutation, and if it is needed, use the internal mutation (Cell, etc.)?

jhpratt commented 2 years ago

Not possible due to pack-compatibility. Pattern, while unstable, is exposed in some stable APIs.

lizelive commented 2 years ago

what is the holdup for making this stable?

ckaran commented 2 years ago

I'm a little confused by the API; the API docs for std::str::pattern::Searcher state that:

This trait provides methods for searching for non-overlapping matches of a pattern starting from the front (left) of a string.

I see the following possible interpretations of this, and I want to be sure which is in use to prevent any ambiguity in implementations.

Greedy approach

If you're looking for the string aa in the haystack aaaaa, then you'll always get a sequence like the following if you require a greedy search:

Match(0,2)
Match(2,4)
Reject(4,5)
Done

Starts at the start of the string, but skips some letters because why not?

Greedy is overrated. Let's skip the first letter and match on the rest!

Reject(0,1)
Match(1,3)
Match(3,5)
Done

Getting the most matches is so overrated, how about we skip some?

The API definition doesn't require that the maximal number of matches be returned, so we could just ignore some matching sub-strings.

Reject(0,1)
Reject(1,2)
Match(2,4)
Reject(4,5)
Done

Suggestions for documentation improvements.

I'd like to suggest that matching is always greedy and always maximal. Roughly the following pseudo-code (don't use this in production, it will overflow your stack, and finite state machines are faster anyways):

// `results` is empty when this function is first called.
// `usize` is 0 when first called.
fn string_matcher(pattern: &str, haystack: &str, results: &mut Vec<SearchStep>, index: usize) {
    if haystack.len() < pattern.len() {
        if haystack.len() > 0 {
            results.push(SearchStep::Reject(index, index + haystack.len()));
        }
        results.push(SearchStep::Done);
    } else if pattern == &haystack[0..pattern.len()] {
        results.push(SearchStep::Match(index, index + pattern.len()));
        string_matcher(
            pattern,
            &haystack[pattern.len()..],
            results,
            index + pattern.len(),
        );
    } else {
        results.push(SearchStep::Reject(index, index + 1));
        string_matcher(pattern, &haystack[1..], results, index + 1);
    }
}

playground

ckaran commented 2 years ago

Also, what about when you want overlapping matches? I can see cases where I would want all overlapping matches in addition to what the Searcher API currently provides.