rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.58k stars 12.62k forks source link

Tracking issue for string patterns #27721

Open alexcrichton opened 9 years ago

alexcrichton commented 9 years ago

(Link to original RFC: https://github.com/rust-lang/rfcs/pull/528)

This is a tracking issue for the unstable pattern feature in the standard library. We have many APIs which support the ability to search with any number of patterns generically within a string (e.g. substrings, characters, closures, etc), but implementing your own pattern (e.g. a regex) is not stable. It would be nice if these implementations could indeed be stable!

Some open questions are:

cc @Kimundi

BurntSushi commented 2 years ago

The docs could certainly be improved. I'm not sure if "greedy" or "maximal" are the right words.

Overlapping matches is a bit of a niche case and I don't think there is a compelling reason for the standard library to support them. Overlapping searches are available in the aho-corasick crate.

ckaran commented 2 years ago

@BurntSushi the main reason for the overlapping case is because then you can say that the searcher needs to return all matches, even the overlapping ones. The user is then responsible for deciding which overlapping case is the interesting one(s). If the searcher implements the Iterator trait, then you can use filtering to get the parts you want.

BurntSushi commented 2 years ago

I don't think that's worth doing and likely has deep performance implications.

ckaran commented 2 years ago

I don't think that's worth doing

I disagree, though I do think that it should be a completely separated from the current API (different function, different trait, whatever is deemed best)

and likely has deep performance implications.

Hah! I agree 110% with you on this! And it's the reason why having it as a separate API is likely the best way to do it.

BurntSushi commented 2 years ago

Overlapping searches are way way way too niche to put into std. If you want to convince folks otherwise, I would recommend giving more compelling reasons for why it should be in std.

ckaran commented 2 years ago

The best example I can give you off the top of my head is very niche, and likely not applicable to str.

I sometimes have to decode streams of bytes coming in from a receiver that can make errors[^1] because of clock skew, mismatched oscillator frequencies, and noise in general. These can show up as bit flips, missing bits, or extra bits. Despite this, I want to know when a legitimate byte stream is starting. The normal (and fast) way is to define some kind of known pattern that signals that a frame is starting, which I'm going to call the start of frame pattern[^2]. To make your own life simple, this pattern is going to be chosen to be highly unlikely to occur by accident in your environment, but it's also really, really easy to look for. One example might be just to have a stream of bits like 0101010101 as your start pattern.

Now here is where things get interesting; while you could use some form of forward error correction (FEC) code to encode the start of frame pattern, continuously decoding all incoming bits to look for the pattern is energy intensive, which means battery life goes down. What you want to do is find the probable start of a frame, and then start the computationally (and therefore power) expensive process of decoding bits only when you are pretty sure you've found the start of a frame. So, you don't bother with proper FEC of the frame pattern. Instead, you make your pattern simple, and your pattern matcher will be just as simple. If it sees a pattern that looks like it could be a start of frame, you turn on your full FEC decoder and start decoding bits until you either decide that you made a mistake, or you have a frame (checksums, etc. come later).

The issue is that the noise I mentioned earlier can show up anywhere, including at the head of the start of frame pattern. So instead of looking for the full 0101010101 start of frame pattern, you might just look for 0101 in overlapping substrings, starting a new FEC decode task as soon as you match the pattern[^3]. Which is where you need the overlapping pattern search.

All of that makes good sense in a byte stream, and that is where the windows method can be helpful. Does any of this make sense for a UTF-8 encoded string that is not subject to errors in encoding? Probably not. But, this is the best I could come up with on the spur of the moment for a practical use case.

[^1]: 'Receiver' in this case might be hardware, like a radio receiver. If the receiver is in a noisy environment, then it's constantly receiving bits, including bits that are just noise.

[^2]: I'm skipping so many details and algorithms that can be used under various conditions, it isn't even funny. If you know that background, just fill in the details in your head, if you don't know them, just ignore them. I'm just trying to give a very simplified example here.

[^3]: The assumption is that since the noise could be anywhere including in the start of frame section, you might have to start correcting for errors that occurred right at the start of your actual byte string.

BurntSushi commented 2 years ago

Yeah I totally grant that there exist use cases for overlapping search. That's not really what I'm looking for, although I appreciate you outlining your use case. What I'm trying to get at here is that they are not frequent enough to be in std. Frequency isn't our only criterion, but I can't see any other reason why std should care at all about overlapping search. If you want it to be in std, you really need to answer the question, "why can't you use a crate for it?" with a specific reason for this particular problem. (i.e., Not general complaints like "I don't want to add dependencies to my project.")

ckaran commented 2 years ago

You're right, on all counts. I don't have a good enough reason for why it should be in std and not some crate, so I'm fine with it being dropped.

That said, I would like to see the documentation clarified on which non-overlapping patterns need to be returned. I'm fine with the docs stating that you can return an arbitrary set of non-overlapping matches, I just want it to be 100% clear as to what is expected of implementors.

Fishrock123 commented 1 year ago

Could Pattern have a flag that could be set on whether to use eq_ignore_ascii_case for its comparisons?

(note: i do not know how Pattern works so maybe it's not possible idk. but it would be very handy!)

BurntSushi commented 1 year ago

@Fishrock123 It is doable, but substring search algorithms are usually not amenable to being adapted straight-forwardly to support case insensitivity. (This isn't true for all of them, but I think is likely true for Two-Way at least, which is the algorithm currently used for substring search.) So in order to support such a flag, you'd probably need to dispatch to an entirely different algorithm.

Another option is the regex crate. It's a bit of a beefy dependency for just a simple ASCII case insensitive search, but it will do it for you. The aho-corasick crate also supports ASCII case insensitivity. While Aho-Corasick is typically used for multi-substring search, it can of course be used with just one substring. aho-corasick is a less beefy dependency than regex.

cehteh commented 1 year ago

This issue is open since over 7 years now, me and likely a lot other people would like to use this in stable.

Would it be possible to move the unstable attributes from the root into the API methods instead. Then at least one could export it in a stable way as '&str' already does (in stable). Example (illustration only, i leave the working bits out):

use std::str::pattern::Pattern;

struct MyStr {
    /*...*/
}

impl MyStr {
    fn from(&str) -> Self {...}
    fn as_str(&self) -> &str {...}

    pub fn split_once<'a, P: Pattern<'a>>(&'a self, delimiter: P) -> Option<(MyStr, MyStr)> {
        match self.as_str().split_once(delimiter) {
            Some(a,b) => Some(Self::from(a), Self::from(b)),
            None => None,
        }
    }
}
StragaSevera commented 1 year ago

I agree, it is sad to see such an issue being abandoned.

drmason13 commented 1 year ago

Before stabilizing the API (seems like this might take some time) one should consider waiting for #44265:

use core::str::pattern::{ReverseSearcher, Searcher};

pub trait Pattern {
    type Searcher<'a>: Searcher<'a>;

Is GAT in its current state suitable for this? I know it has some limitations.

If it is, then I imagine it must be worth considering this API while we are still unstable?

Related to this, I'm writing a function that ultimately searches a &str, the obvious (to me) signature was:

fn until(pattern: impl Pattern);

But I guess it would need some lifetime generic with the current API.

tgross35 commented 11 months ago

I think the summary here is that this API needs rework and somebody to champion it. There was some good discussion at #71780, including a rough proposal from @withoutboats in https://github.com/rust-lang/rust/pull/71780#issuecomment-688250368. This kinda sorta echos @Luro02's outline in https://github.com/rust-lang/rust/issues/27721#issuecomment-828271222 (it seems like GATs provide us with a more ergonomic solution in any case)

Another thing to keep in mind is that slice patterns were removed (https://github.com/rust-lang/rust/pull/76901#issuecomment-880169952) but we may want some way to work with &[u8] byte strings. It is a bit of a pain point that many std APIs require a UTF-8 &str when it isn't always needed, meaning that there is a runtime cost for str::from_utf8 to do things without unsafe when you have mostly-but-maybe-not-completely UTF-8 sequences (e.g., OsStr / Read interaction)

So the next steps forward, probably:

  1. Somebody puts forth a design proposal. I don't think this needs to be a RFC since the concept was already accepted, but it has been so long that I think we just need a from-scratch design with justification and documented limitations. An ACP is probably a good place for this (acps are just an issue template at https://github.com/rust-lang/libs-team, link it here if you post one)
  2. Implement that proposal
  3. Revisit stabilization after it has been around for a while

It is unfortunate that we more or less have to go back to square one with stabilization, but there have been a lot of lessons learned and better ways to do things since the 2014 RFC (a decade!). Really this is probably just in need of somebody to take charge of the redesign and push everything forward.

cehteh commented 11 months ago

All I'd really asked for above is to stabilize the existence of the Pattern API, that would already address a lot of problems removing unstable bits from the stable Rust stdlib API.

When the API/implementation behind needs more work, that's Ok. But honestly after that much years and many people relying on patterns, overly big changes would be quite surprising.

tgross35 commented 11 months ago

All I'd really asked for above is to stabilize the existence of the Pattern API, that would already address a lot of problems removing unstable bits from the stable Rust stdlib API.

That would of course be nice, but we don’t want to do that until knowing for sure that we won’t need to change generics from what there currently is (a single lifetime). Probably unlikely, but there’s no way of knowing without a concrete proposal.

When the API/implementation behind needs more work, that's Ok. But honestly after that much years and many people relying on patterns, overly big changes would be quite surprising.

I think it’s the opposite: all the discussion here, the very long time with no stabilization, and the proposed replacements I linked in https://github.com/rust-lang/rust/issues/27721#issuecomment-1790121646 seem to indicate that nobody is happy enough with this API as-is. This feature needs a champion who is willing to experiment and push things along.

Phosphorus-M commented 7 months ago

Do we have some news about this feature?

Luro02 commented 7 months ago

Do we have some news about this feature?

Please stop poluting the issue tracker. If there is an update, someone will link to this issue.

mqudsi commented 4 months ago

I realize there are a lot of issues with stabilizing the pattern feature itself, but if I may suggest a possible middle ground, would it be possible to work on detecting particular pattern search patterns and try to just optimize (a subset of) those (possibly via some specific hint, perhaps only using iter instead of in a for loop) to internally use some version of TwoWaySearcher (or anything, really)?

There are no guarantees for compiler optimizations, there's no specific api we would have to stabilize, and we don't need to handle all the cases (not all at once nor even eventually), but it could give a decent performance boost to certain code and be both an asset in the short-term (until a stable pattern matching api becomes available) as well as in the long term (code not using the pattern matching api could still benefit).

Not sure how doable this is in the technical sense, but at least on paper it might be worth considering?