Jules-Bertholet commented 3 weeks ago

Proposal

Problem statement

I want to:

Split a string at several possible delimiters, where each delimiter may consist of zero or more characters
Find the first/last occurrence of any string out of a set in a larger string

Motivating examples or use cases

Parsing a file format that uses several possible delimiters
Searching for any of several possible strings while iterating over the haystack only once

Solution sketch

Implement core::str::Pattern for [&str; N] and &[&str; N]. The corresponding Searchers will implement ReverseSearcher, but not DoubleEndedSearcher.

In the future, once negative impls are stable, implement !Borrow<str> for char, and extend the Pattern impls to apply to arrays/slices of any impl Borrow<str>.

The implementation would use a non-allocating algorithm with O(len(haystack) * |needles|) time complexity (hopefully even in the worst case, though I will need to consider further whether this is achievable), as opposed to something more involved like aho-corasick. The exact algorithm may also change dynamically based on the length of the haystack and/or each needle.

Alternatives

This functionality could be implemented as a crate on crates.io, but this would not allow usage with all the std::str APIs that use pattern, like find and split.

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

We think this problem seems worth solving, and the standard library might be the right place to solve it.
We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

BurntSushi commented 3 weeks ago

The implementation would use a relatively simple algorithm, that is non-allocating and optimized for haystacks that are not too large, as opposed to something more involved like aho-corasick.

This seems like an enormous drawback and performance footgun. And not just "oh it isn't SIMD optimized," but by imposing the non-allocating restriction, you probably have a serious time complexity footgun here too. I'm not even sure what's the best you can do for multi-substring search without either any allocation or some kind of limit on the size/number of patterns.

std also suffers from being unable to amortize single substring searcher construction. I believe your proposal here means that multi-substring search would suffer from the same problem, but I imagine it will be exacerbated, depending on the implementation.

pitaj commented 3 weeks ago

Worst case would be if the haystack is highly repetitive and most of the pattern strings are very similar. Something like

"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab".find(&[
    "aaaaaaaaaaaaaab",
    "aaaaaaaaaaaaaac",
    "aaaaaaaaaaaaaad",
    "aaaaaaaaaaaaaae",
    "z", // kills common prefix optimization
])

aho-corasick would handle that quite well, but it's hard to envision how anything could efficiently perform that search without allocating.

Jules-Bertholet commented 3 weeks ago

The time complexity would be O(# strings in needle) * (time complexity of single-&str matching). Which is perfectly intuitive, it's what you'd expect from a naïve algorithm, and it's also perfectly adequate if "# strings in needle" is small.

BurntSushi commented 3 weeks ago

The time complexity would be O(# strings in needle) * (time complexity of single-&str matching). Which is perfectly intuitive, it's what you'd expect from a naïve algorithm, and it's also perfectly adequate if "# strings in needle" is small.

I'm pretty sure I would never approve of this coming into the standard library. You will just never convince me that this proposal won't lead to folks being constantly surprised by how slow it is. The fact that you keep qualifying this with "assuming everything is small" isn't a mitigation from my perspective, because there is nothing in your proposal preventing folks from using it with inputs that aren't small.

Jules-Bertholet commented 3 weeks ago

there is nothing in your proposal preventing folks from using it with inputs that aren't small.

A note in the the docs would do the job. This wouldn't be the only API in std that prioritizes broad compatibility and small code size over maximum performance for every possible input.

BurntSushi commented 3 weeks ago

I don't agree it would the job. And I don't know of any precedent that is even remotely in the same league as the perf footgun this would introduce.

the8472 commented 3 weeks ago

And I don't know of any precedent that is even remotely in the same league as the perf footgun this would introduce.

Well, there's the provider API where I'm unhappy about the perf implications. But that's still unstable. That said, people do demand MOAR SPEED for practically any method in the standard library. So yeah, starting with a bad implementation and no chance for improvement does not look good.

Jules-Bertholet commented 3 weeks ago

One alternative would be to not have any implementation for slices, only arrays; that way, we can store O(needles) state in the searcher, which would permit using a better algorithm.

BurntSushi commented 3 weeks ago

Well, there's the provider API where I'm unhappy about the perf implications. But that's still unstable.

I wouldn't accept this as a precedent until I could see how the perf footgun actually compares. The perf footgun being proposed here is not just a small thing. It's enormous. Consider searching 100 strings in a 1GB haystack. That's 100 full scans of that haystack. That's 99 more than is actually necessary.

And if you clear that precedent hurdle, I'd also need to see what the alternatives are. Even if the perf footgun is enormous, it has to be considered in the context of what else is actually possible. In this case, the API being proposed has a very simple alternative: use a third party crate.

that way, we can store O(needles) state in the searcher, which would permit using a better algorithm.

O(needles) state is not correct as far as I can tell. What you'd get is needles.map(|n| n.len()).sum() bytes of memory. So if an algorithm required, say, 2 * (sum of lengths of all needles), then you'd be out of luck. Anyway, I'd need to see the algorithm being proposed to fully evaluate the proposal.

And that is not the only hurdle to clear here. The other hurdle is the construction cost. We have that problem with single substring search today, but I speculate it would be worse with multi-substring search.

Jules-Bertholet commented 3 weeks ago

What you'd get is needles.map(|n| n.len()).sum() bytes of memory.

~Yes, that's what I meant.~ (Edit: no it's not, O(needles) is really what I am referring to) And yes, it's not good enough for many common algorithms.

If we can find a workable algorithm, one option would be to only use it if the number or length of the needles is above some threshold, and otherwise use the naïve solution (which would presumably be faster to construct).

Jules-Bertholet commented 3 weeks ago

I'd need to see the algorithm being proposed to fully evaluate the proposal.

I think our best bet would likely be Rabin-Karp, with (# of needles) rolling hashes. Average-case complexity would be O(length of haystack # of needles), but worst-case time complexity would be O(length of haystack sum of needle lengths)—same as the naïve implementation. Because we have no access to randomness in core, the implementation would not be HashDOS-resistant.

We could try to only require one hash per set of needles with the same length, perhaps; but that will only be a benefit when many of the needles have the same length, so would have to be weighed against the impact on implementation complexity and construction cost.

One other potential optimization would be to use the naïve approach instead of Rabin-Karp for very small needles.

I've updated the OP to reflect this.

BurntSushi commented 3 weeks ago

I would need to see an implementation because I'm not sure how you implement Rabin-Karp with needles.map(|n| n.len()).sum() bytes of memory.

I want to be clear though, that even if you solve the time complexity problem, I've pointed out other hurdles that IMO still makes this proposal dead in the water. (In other words, you might not want to spend your time trying to come up with the implementation I'm asking for unless you're intrinsically motivated to discover it.)

Jules-Bertholet commented 3 weeks ago

I would need to see an implementation because I'm not sure how you implement Rabin-Karp with needles.map(|n| n.len()).sum() bytes of memory.

First of all, I realize I made a mistake/misinterpreted your comment above, sorry. I really did mean O(needles) of memory, not needles.map(|n| n.len()).sum() bytes. Which is what we would get with a searcher for [&str; N], as said searcher would itself be parameterized by N.

And secondly, I am realizing that we may not need to use Rabin-Karp (with its bad worst case), the two-way searcher algorithm that the single-string search functions use should be viable also with some adaptation. I'll need to think about it more though.

I've pointed out other hurdles that IMO still makes this proposal dead in the water.

If you are referring to long initialization time: as I mentioned earlier, we can dynamically choose the algorithm based on the length of each needle, and possibly also they haystack, which search algorithm to use for that needle. So for short needles, we can use the naïve implementation with its fast construction time. (This is something the single-string seach could be doing also).

BurntSushi commented 3 weeks ago

If you are referring to long initialization time: as I mentioned earlier, we can dynamically choose the algorithm based on the length of each needle, and possibly also they haystack, which search algorithm to use for that needle. So for short needles, we can use the naïve implementation with its fast construction time. (This is something the single-string seach could be doing also).

I'm well aware of the strategy of dynamically choosing an algorithm. regex, aho-corasick and memchr all make use of that strategy extensively. But it doesn't fix the underlying problem. It just makes some cases better, depending on what you're doing. And the very existence of trying to "make a choice" in the first place will itself be a hit to search times.

Jules-Bertholet commented 3 weeks ago

And the very existence of trying to \"make a choice\" in the first place will itself be a hit to search times.

A hit that we are forced to take anyway, due to the "" needle special case.

BurntSushi commented 3 weeks ago

You're only forced to pay O(n) for that check that because of the API proposed. But it is not intrinsic to the problem domain. If you build a searcher from a fixed set of needles and are allowed to use it with different haystacks, then the construction phase can do O(n) work on the needles once. This includes determining if there is an empty needle. Then at search time, it can do O(1) work to handle the special case of the empty needle. But in the API you've proposed, you're forced to do O(n) work to handle that case. The API boxes you into a severe performance ceiling. And this is just the empty needle case. For anything other than naive search, you will also have to pay some cost to build a searcher.

My objection here is on two completely different grounds. The first is that the API itself imposes a ceiling on performance because every single search (other than in an iterator) is forced to rebuild whatever is needed for the search. The second is that even if we didn't care about that construction cost, the implementation being restricted to zero-allocation is significant and severely limits what you're able to do.

The way to resolve the first objection is to champion a new substring search API that permits amortizing searcher construction. I do not see any other possible choice here. I do not see myself approving multi-substring search in std that doesn't allow amortizing searcher construction. And this is a bit of a catch-22, because amortizing searcher construction will likely make the API surface area much bigger and more complicated, which will in turn make it harder to drive consensus. And now we're getting into, "why not just use the aho-corasick crate instead?"

The way to resolve the second objection is to find a way around the no-allocation restriction. You are chasing a path that involves using stack space proportional to the number of the needles (which seems dubious to me) and dynamic switching of algorithms employed, but I personally don't see a way to make that work in a way that would be appropriate for std. IMO, the more promising route here would be to figure out how to remove the no-alloc restriction. But I don't know the feasibility of that because it involves the core/alloc/std split.

Jules-Bertholet commented 3 weeks ago

The way to resolve the first objection is to champion a new substring search API that permits amortizing searcher construction. I do not see any other possible choice here.

I agree, but I also don't see why it should be a blocker to making any progress at all. As you note, the searcher construction problem already exists for the single-substring case. And the extra cost per needle is no greater for the API I am proposing, than for the existing Pattern implementation for a single &str needle.

Also, the initial heuristic for whether to use the naïve algorithm would likely be just a check of the needle length, so there would only be a risk of major slowdown when using many long needles. Which is perfectly acceptable, just like it's perfectly acceptable that the existing Pattern impl for [char; N] is suboptimal for large N (it does N equality checks per character in the haystack, instead of using a set data structure).

IMO, the more promising route here would be to figure out how to remove the no-alloc restriction.

No, I want an implementation that doesn't allocate. I want the implementation of Pattern for [&str; N] to be to the implementation for &str, what the implementation for [char; N] is to that for char. I.E. more or less "do what you would do for a single needle, but for this whole list of needles instead." It's an intuitive and useful API, that nicely complements what the standard library already offers. aho-corasick is great, but it does something different, which is not what I need, and not analogous to the existing Pattern implementations already in core.

BurntSushi commented 3 weeks ago

It's an intuitive and useful API

With an enormous perf footgun.

I agree, but I also don't see why it should be a blocker to making any progress at all. As you note, the searcher construction problem already exists for the single-substring case.

Because I don't want to make it worse. And I speculate that it will be exacerbated for the multiple needle case.

Otherwise, it looks like we're going in circles.

It's an intuitive and useful API, that nicely complements what the standard library already offers. aho-corasick is great, but it does something different, which is not what I need, and not analogous to the existing Pattern implementations already in core.

I don't see anything you've proposed here that can't be done by aho-corasick. (Other than perhaps reverse searching?)

I would suggest that you go and prototype what you want in a crate. Once that's done, feel free to ping my handle for a review. Otherwise, I'm unsubscribing from this thread.

Jules-Bertholet commented 3 weeks ago

I would suggest that you go and prototype what you want in a crate.

Agreed, will do

rust-lang / libs-team

`impl core::str::Pattern for [&str; N]` #395

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Alternatives

What happens now?

Possible responses