rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

https://docs.rs/regex

Apache License 2.0

3.49k stars 438 forks source link

execute a regex on text streams #425

Open cessen opened 6 years ago

cessen commented 6 years ago

This is more-or-less a continuation of issue #25 (most of which is actually here).

Preface

I don't personally have an urgent need for this functionality, but I do think it would be useful and would make the regex crate even more powerful and flexible. I also have a motivating use-case that I didn't see mentioned in the previous issue.

More importantly, though, I think I have a reasonable design that would handle all the relevant use-cases for streaming regex--or at least would make the regex crate not the limiting/blocking factor. I don't have the time/energy to work on implementing it myself, so please take this proposal with the appropriate amount of salt. It's more of a thought and a "hey, I think this design might work", than anything else.

And most importantly: thanks so much to everyone who has put time and effort into contributing to the regex crate! It is no coincidence that it has become such staple of the Rust ecosystem. It's a great piece of software!

My use-case

I occasionally hack on a toy text editor project of mine, and this editor uses ropes as its in-memory text data structure. The relevant implication of this is that text in my editor is split over non-contiguous chunks of memory. Since the regex crate only works on contiguous strings, that means I can't use it to perform searches on text in my editor. (Unless, I suppose, I copy the text wholesale into a contiguous chunk of memory just to perform the search on that copy. But that seems overkill and wouldn't make sense for larger texts.)

Proposal

In the previous issue discussing this topic, the main problem noted was that the regex crate would have to allocate (e.g. a String) to return the contents of matches from an arbitrary stream. My proposed solution essentially amounts to: don't return the content of the match at all, and instead only return the byte offsets. It is then the responsibility of the client code to fetch the actual contents. For example, my editor would use its own rope APIs to fetch the contents (or replace them, or whatever), completely independent of the regex crate.

The current API that returns the contents along with offsets could (and probably should) still be included as a convenience for performing regex on contiguous slices. But the "raw" or "low level" API would only yield byte offsets, allowing for a wider range of use-cases.

Layered API

I'm imagining there would be three "layers" to the API, of increasing levels of convenience and decreasing levels of flexibility:

1. Feed chunks of bytes manually, handling matches as we go

let re = Regex::new("...");
let mut matcher = re::streaming_matcher();

for match in matcher.consume("Some bytes from a larger stream of data that") {
    my_data_source.do_something(match.start(), match.end());
    // Note: there is no match.as_str() for this API.
}

for match in matcher.consume(" we don't know how much there might be in total.") {
    // ...
}

2. Give regex an iterator that yields bytes

let re = Regex::new("...");

for match in re::find_iter_from_bytes_iter("Non-contiguous bytes that can be iterated over.".bytes()) {
    // Again, only match.start() and match.end() for this API.
    // ...
}

3. Give regex a slice, just like the current API

let re = Regex::new("...");

for match in re.find_iter("A directly passed slice of data.") {
    match.as_str(); // In this API you can get the str slice of the match.
    // ...
}

I'm of course not suggesting naming schemes here, or even the precise way that these API's should work. I'm just trying to illustrate the idea. :-)

Note that API 2 above addresses my use-case just fine. But API 1 provides even more flexibility for other use-cases.

Things this doesn't address

BurntSushi noted the following in the previous discussion (referencing Go's streaming regex support):

The most interesting bit there is that the regexp functions on readers finds precisely one match and returns byte indices into that stream. In the Go world, if I wanted to use that and actually retrieve the text, I'd write my own "middleware" reader to save text as its being read. Then I could use that after the first match to grab the matched text. Then reset the "middleware" reader and start searching again.

The problem with that approach: what if there's never a match? You'd end up piling the whole stream into memory (likely defeating the purpose of using a stream).

This proposal doesn't solve that problem, but rather side-steps it, making it the responsibility of the client code to decide how to handle it (or not). Practically speaking, this isn't actually an API problem but rather is a fundamental problem with unbounded streaming searches.

IMO, it doesn't make sense to keep this functionality out of the the regex crate because of this issue, because the issue is by its nature outside of the regex crate. The important thing is to design the API such that people can implement their own domain-specific solutions in the client code.

As an aside: API 1 above could be enhanced to provide the length of the longest potential match so far. For clarity of what I mean, here is an example of what that might look like and how it could be used:

// ...
let mut matcher = re::streaming_matcher();
let mut buffer = String::new();
let mut offset = 0;

loop {
    // Get next chunk of streaming data
    buffer.push(streaming_data_source.get_data());

    // Handle matches
    for match in matcher.consume(&buffer) {
        match_contents = &buffer[(match.start()-offset)..(match.end()-offset)];
        // ...
    }

    // Shrink buffer to smallest size that is still guaranteed to contain all data for
    // potential future matches.
    let pot_match_len = matcher.longest_potential_match_len();
    offset += buffer.len() - pot_match_len;
    buffer.truncate_from_front(pot_match_len);
}

That would allow client code to at least only hold onto the minimal amount of data. Nevertheless, that only mitigates the problem, since you can still have regex's that match unbounded amounts of data.

BurntSushi commented 6 years ago

Thanks for taking the time to write this up!

I'm afraid the issue tracker (and the issue you linked) isn't entirely up to date with my opinion on this particular feature. Basically, I agree that this would be useful to have. API design is certainly one aspect of this that needs to be solved, and I thank you for spending time on that. However, the API design is just the surface. The far more difficult aspect of providing a feature like this is the actual implementation work required. My estimation of the level of difficulty for this is roughly "a rewrite of every single regex engine." That's about as bad as it gets and it's why this particular feature hasn't happened yet. The problem is that the assumption that a regex engine has access to the entire buffer on which it can match is thoroughly baked in across all of the matching routines. For example, as a trivially simple thing to understand (but far from the only issue), consider that none of the existing engines have the ability to pause execution of a match at an arbitrary point, and then pick it back up again. Such a thing is necessary for a streaming match mode. (There is an interesting compromise point here. For example, the regex crate could accept an R: io::Read and simply search that until it finds a match. This would avoid needing the aforementioned pause/resume functionality, and I think it could be added without rewriting the regex crate, but it would be a significant complication IMO. However, there are problems with this approach, see below.)

Getting back to API design, I think we would want to take a very strong look at Hyperscan, whose documentation can be found here. This particular regex library is probably the fastest in the world, is built on finite automata, and specifically specializes in matching regexes on streams. You'll notice that an interesting knob on their API is the setting of whether to report the start of a match or not. This is important and particularly relevant to the regex crate. In particular, when running the DFA (the fastest regex engine), it is not possible to determine both the start and end of a match in a single pass in the general case. The first pass searches forwards through the string for the end of the match. The second pass then uses a reversed DFA and searches backwards in the text from the end of the match to find the start of the match. Cool, right? And it unfortunately requires that the entire match be buffered in memory. That throws a wrench in things, because now you need to handle the case of "what happens when a match is 2GB?" There's no universal right answer, so you need to expose knobs like Hyperscan does.

The start-of-match stuff throws a wrench in my idea above as well. Even if the regex crate provided an API to search an R: io::Read, it would in turn need to do one of the following things with respect to reporting match locations:

Only provide end of match indices.
Always provide start of match indices, at the cost of running a slower regex engine. (The NFA based engines can find the start and end of match in a single pass.) Note that this is the choice that Go's regex engine makes. Also note that "slow" here approximately means "an order of magnitude slower."
Always provide start of match indices, at the cost of buffering the entire R: io::Read into memory. (Which kind of defeats the purpose of providing this API at all.)

You can also see that RE2 doesn't have such an API either, mostly for the reasons I've already outlined here. That issue does note another problem, in particular, managing the state of the DFA, but that just falls under the bigger problem of "how do you pause/resume each regex engine" that I mentioned above.

For Go's regexp engine, there is someone working on adding a DFA, and even in their work, they defer to the NFA when given a rune reader to use with the DFA. In other words, even with the DFA, that particular implementation chooses (2).

What am I trying to say here? What I'm saying is that a number of people who specialize in this area (including myself) have come to the same conclusion: the feature you're asking for is freaking hard to implement at multiple levels, at least in a fast finite automaton based regex engine. So this isn't really something that we can just think ourselves out of here. The paths available are known, they are just a ton of work.

With that said, this is something that I think would be very cool to have. My plans for the future of regex revolve around one goal: "make it easier to add and maintain new optimizations." This leaves me yearning for a more refined internal architecture, which also incidentally amounts to a rewrite. In the course of doing this, my plan was to re-evaluate the streaming feature and see if I could figure out how to do it. However, unfortunately, the time scale on this project is probably best measured in years, so it won't help you any time soon.

If you need this feature yesterday, then this is the best thing I can come up with, if you're intent on sticking with pure Rust:

Convince yourself that you're OK with a slow regexp implementation.
Fork the regex crate and remove everything except for the Pike VM. Remove the DFA, the literal optimizations and the backtracker. Make it pass the test suite.
Modify the Pike VM to accept an R: io::Read.

Basically, this amounts to "make the same trade off as Go." You could lobby me to make this a feature of the regex crate (with the warning that it will always run slowly), but I'm not particularly inclined to do that because it is still quite a bit of work and I'd rather hold off on adding such things until I more thoroughly understand the problem domain. It's basically a matter of priorities. I don't want to spend a couple months of my time adding a feature that has known bad performance, with no easy route to fixing it.

Apologies for the bad news!

BurntSushi commented 6 years ago

@cessen What do you think about closing this ticket as a duplicate of #386?

cessen commented 6 years ago

@BurntSushi Wow, thanks so much for the thorough explanation! Indeed, I didn't realize there were so many other issues involved. The older discussion I looked at made it seem like it would mostly be easy, but it just wasn't clear how the API could work. So I appreciate you clarifying all this!

I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex It doesn't (as far as I can tell) support streaming, but it does take an iterator rather than a slice to do the search. But indeed the iterator is bidrectional.

(As an aside, it seems slightly strange to me that the Rust stdlib doesn't have a bidirectional iterator trait. It does have double-ended, but that's different. Maybe I should write up a Rust RFC.)

As I said before, my own use-case is not urgent. It's only for a toy project anyway. No one is going to die or lose their job. ;-) But I appreciate your "if you need this yesterday" explanation. I may end up doing that at some point, if/when I get around to adding regex to my editor.

@cessen What do you think about closing this ticket as a duplicate of #386?

Ah! I feel quite silly, now! I didn't notice that ticket. Apologies.

Having said that, I think this ticket actually covers a super-set of #386. #386 is essentially just API 2 in this write-up, and doesn't cover the streaming case (API 1). And although my personal use-case only needs API 2, I was intentionally trying to cover the wider range of use-cases covered in #25.

So I guess it depends on how relevant you feel the streaming use-case is?

In any case, I won't be at all offended if you close this as duplicate. :-)

BurntSushi commented 6 years ago

Sounds good to me! I closed out #386. :-)

AlbertoGP commented 6 years ago

That was a greatly useful explanation for me. I'm writing a tool that needs string matching although not general regexps, and have been looking at three alternatives:

use the regex crate
build on the aho_corasick crate (I know it's used inside regex)
fully custom pattern matcher (have done this before, which is why I'm first checking whether I could use yours instead for this specific application)

The ideal for me would be a RegexSet that would give me not just the indexes of matched patterns but also the (start,end) indexes, or rather the (start,end) of the first match, and that could be paused/restarted to feed it the input in blocks, not all at once. [EDIT: just noticed that the "match first" RegexSet feature is already requested at #259]

I see now why all that is not possible at the moment, and even if implemented it would be slower than what we already have.

Thanks!

BurntSushi commented 6 years ago

I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex It doesn't (as far as I can tell) support streaming, but it does take an iterator rather than a slice to do the search. But indeed the iterator is bidrectional.

I didn't respond to this before, but to be clear, C++'s standard library regex engine is backtracking AFAIK, which has a completely different set of trade offs in this space. If you asked me to do this feature request in a backtracking engine, my first thought to you would be "forget everything I just said." :)

cessen commented 6 years ago

@BurntSushi Yeah, that totally makes sense. Thanks!

Also, looking at pikevm.rs, my initial reaction (though I may be misunderstanding some things) is that it seems like turning the main matching loop into something that can be paused/resumed wouldn't be too much work.

Although I don't think I have the time/energy for it right now (I'm putting most of my spare time into Ropey), if I get around to it would you be offended if I took a crack at creating a separate crate to accommodate the use-cases outlined in this issue, using just the slower NFA engine from this crate? My intent in doing so would be twofold:

Give an immediate (even if not optimal) solution for people with these use-cases.
Better explore the API design space.

This would all be with the idea that these use-cases would be folded back into this crate whenever you get the time to tackle your bigger plans for it, and my fork would be then deprecated. But it would give a stop-gap solution in the mean time.

BurntSushi commented 6 years ago

@cessen I would not be offended. :) In fact, I suggested that exact idea above I think. You are right if course that the pikevm is more amenable to this change. The bounded backtracker might also be capable of handling it, but I haven't thought about it.

cessen commented 6 years ago

Yup! I was double-checking that you were okay with it being a published crate, rather than just an in-project-repo sort of thing. Thanks much!

If I get started on this at some point, I'll post here.

cessen commented 6 years ago

I've started poking at this in https://github.com/cessen/streaming_regex

So far I've ripped out everything except the PikeVM engine, and also ripped out the byte regex. My goal is to get the simple case working, and then I can start adding things back in (like the literal matcher, byte regex, etc.).

Now that I've gotten to this point, it's really obvious what you meant @BurntSushi , regarding everything needing to be re-done to make this possible!

The first thing I'm going to try is to rework the PikeVM engine so that it incrementally takes a byte at a time as input. I think this can work even in the unicode case by having a small four-byte buffer to collect the bytes of unicode scalar values, and only execute regex instructions once a full value has been collected. Hopefully that can be done relatively efficiently.

Once that's done, then building on that for incremental input will hopefully be relatively straight-forward (fingers crossed).

Does that sound like a reasonable approach, @BurntSushi ?

BurntSushi commented 6 years ago

@cessen Indeed it does! Exciting. :-)

cessen commented 6 years ago

@BurntSushi I'm getting to the point where I have some more solid API proposals, largely for internal API's at this point (e.g. the RegularExpression trait). Since I would like this work to be applicable to this crate at some point in the future (even if not directly pulled), I would love to discuss it with you. Should I do that here, or make an issue on my fork and ping you from there?

BurntSushi commented 6 years ago

I wouldn't bother honestly. I would do whatever is natural for you. I expect the current internals to be completely redone before something like streaming functionality could get merged.

I'm happy to be a sounding board of course!

cessen commented 6 years ago

Oh sure, I'm not expecting it to be mergeable. But I'm hoping that the architecture of this fork will be relevant to your rewrite in the future, so that any lessons learned can be applied. So I'd rather not go off in some direction that's completely tangential to (or fundamentally clashing with) what you have in mind for the future. If that makes sense?

(Edited to add the "not" before "expecting" above. Really bad typo...)

BurntSushi commented 6 years ago

Yeah I think so. I'm happy to give feedback as time permits. :)

cessen commented 6 years ago

Awesome, thanks! And, of course, I don't mean to squeeze you for time. Apologies for coming off a bit presumptuous earlier--I didn't mean it that way.

I think I'll post here, if that's okay. Let me know if you'd prefer elsewhere!

cessen commented 6 years ago

So, the first thing is a bit bigger-picture: I think we'll have a lot more wiggle-room to experiment with fast incremental scanning approaches if we keep the incremental APIs chunk-based rather than byte-based.

This basically amounts to changing API 2 in my original proposal above to take an iterator over byte slices instead of an iterator over bytes. I think that still covers all of the real use-cases. (And if someone really needs to use a byte iterator with the API, they can do their own buffering to pass larger chunks.)

The relevance to what I'm doing right now is that I'm thinking of changing the RegularExpression trait in re_trait.rs to be oriented around sequential chunk input. This mostly involves adding something like a is_last_chunk parameter to the various *_at() methods. It might look something like this:

pub trait RegularExpression {
    fn is_match_at(
        &self,
        chunk: &[u8],           // Analagous to `text` in your code
        offset_in_chunk: usize, // Analagous to `start` in your code
        is_last_chunk: bool,
    ) -> bool;

    fn find_at(
        &self,
        chunk: &[u8],
        offset_in_chunk: usize,
        is_last_chunk: bool,
    ) -> Option<(usize, usize)>;

    // etc...
}

Calling code would pass a chunk at a time, and indicate via is_last_chunk if it's the end of input. This approach puts the onus on implementors of RegularExpression to keep track of how many bytes from start-of-input have already been consumed, but I think that's pretty reasonable...?

The *_iter() methods would be passed iterators over text chunks, and Iterator::size_hint() would be used internally to know when we're at the last chunk:

pub trait RegularExpression {
    // The above stuff

    fn find_iter<I: Iterator<Item=&[u8]>> (
        self,
        text: I
    ) -> Matches<Self> {
        // ...
    }

    // etc...
}

One of the nice things about this approach is that a single contiguous text is just a special case: for find_at() et al. you just pass the text as a byte slice with is_last_chunk = true. And for the iterator methods, you just pass [text].iter().

Using DFA internally on contiguous text falls naturally out of this as well, since we can switch to DFA when we know we're on the last chunk. And this gives us plenty of room to experiment with approaches to faster incremental scanning.

(Incidentally, the reason I'm focusing on the RegularExpression trait here is that it seems to be the glue between the higher-level and lower-level APIs, so I'm thinking it will have a lot of influence over the implementations of both. But I could be wrong about that!)

BurntSushi commented 6 years ago

@cessen Everything you said sounds pretty sensible to me. Here are some random thoughts:

Using size_hint on iterators can't be assumed to be correct. You can use a peekable iterator (or equivalent).
You might consider alternatives from the is_last_chunk construction. For example, you could state that "passing an empty chunk indicates end of input."
I think consideration should be given to io::Read as well. I believe it's possible to write an adapter on top of your proposed API that takes an arbitrary io::Read and executes a search, right? If so, that's good. Just wanted to put it on your radar. :)

cessen commented 6 years ago

Thanks for the feedback!

As far as I understand from the documentation, size_hint can't be assumed correct for purposes of memory safety, but can be assumed correct for other purposes. Specifically, "[...]That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait's protocol." Of course, I may be interpreting that incorrectly.
Yeah, I was thinking about that too. The down-side to that approach is then we don't know when we're on the last chunk while processing it, which means we lose the ability to e.g. switch to DFA for the last chunk. I agree that the is_last_chunk parameter is awkward and clunky, but I think the benefit may be worth it if it lets other things "just work". However, I'm certainly open to other approaches.
Absolutely! I think that's another benefit of the chunk-based approach, actually: it maps nicely to how io::Read::read() works. It should be pretty trivial to build an io::Read based API on top of API 1 in the OP proposal. In any case, to put you at ease: I also agree it's important to support that, and I'm definitely approaching things with that in mind.

BurntSushi commented 6 years ago

@cessen

The key part of the docs is this line: "The default implementation returns (0, None) which is correct for any iterator." The part you quoted is when the reported lower/upper bounds are wrong. But a 0 lower bound is never wrong and an unknown upper bound is also never wrong. An example of something that's wrong is returning 1 as a lower bound for an iterator over an empty slice.

cessen commented 6 years ago

Yes, that was my understanding as well. I didn't mean to imply that size_hint guarantees that we know when we're on the last chunk, but rather that when it's indicated that we can know, it's correct and we can take advantage of it.

cessen commented 6 years ago

Or, more specifically, if we reach the upper bound (if one is given) we can set is_last_chunk to true.

We would still have to do something like pass an empty chunk at the end when the upper bound isn't given or when we reach the end before the upper bound.

cessen commented 6 years ago

So, regardless of all of this, I think your suggestion to use the peekable iterator is a better idea. For some reason I thought not all iterators could be made peekable, but apparently they can. So that would be a great approach!

cessen commented 6 years ago

Quick status report:

The PikeVM engine now takes a single token at a time, and doesn't access the Input anymore. Essentially, that means that it can be built off of to do streaming input now, though for utf8 it has to be passed an entire scalar value at a time.

Next step is to get it to take just a single byte at a time.

One thing I've noticed is that I think the PikeVM can be made to handle Unicode and byte data use-cases at the same time, even within the same regex (assuming the syntax supports making a distinction). I'm guessing with the DFA that's not possible in a reasonable way, because it can only be in one state at a time...? It's not particularly relevant to the use-cases I'm trying to accommodate, but I think it's interesting nonetheless.

Edit: @BurntSushi Also, gotta say, all the unit tests you wrote are a life-saver! Would never have been able to get to this point without them!

BurntSushi commented 6 years ago

One thing I've noticed is that I think the PikeVM can be made to handle Unicode and byte data use-cases at the same time, even within the same regex (assuming the syntax supports making a distinction)

It does indeed. Regexes like .+(?-u:.+) (which matches any sequence of Unicode scalar values followed by any sequence bytes) are perfectly valid for the bytes::Regex type. All of the regex engines can handle this case. Right now, the opcodes include both Unicode specific opcodes and byte-specific opcodes. Technically, the Unicode specific opcodes are not necessary, but I believe they might be faster in some cases. Or at the very least, the full performance profile isn't well understood by me, so I've left them there.

I'm guessing with the DFA that's not possible in a reasonable way, because it can only be in one state at a time...? It's not particularly relevant to the use-cases I'm trying to accommodate, but I think it's interesting nonetheless.

The DFA is always byte-at-a-time.

Also, gotta say, all the unit tests you wrote are a life-saver! Would never have been able to get to this point without them!

:-)

cessen commented 6 years ago

The DFA is always byte-at-a-time.

Oh, interesting. Is the Regex/bytes::Regex divide purely for the benefit of the PikeVM, then? In which case, once I make it byte-based, I could merge the two?

BurntSushi commented 6 years ago

Yes. And probably the backtracker. The Unicode instructions also use less space, but that is a specious argument because if we removed them, we might be able to reduce the total number of copies of the program bytecode from 3 to 2. (I'm kind of just riffing here based off memory.)

sanmai-NL commented 6 years ago

@cessen: thanks for your great work! I was wondering how your project is progressing? Have you perhaps planned towards some milestone?

cessen commented 6 years ago

@sanmai-NL Sorry for my lack of reports! I haven't worked on it since my last comment in this thread, so that's still the current status. I didn't stop for any particular reason, other than getting distracted by other things in my life.

I don't know when I'll pick it back up again, so feel free to take the work I've done so far and build on it. I think I stopped at a pretty reasonable point, having gotten the PikeVM to work incrementally one token at a time. The next steps are basically:

Get the PikeVM to work one byte at a time (optional if you only want unicode regex, IIRC).
Build an incremental regex API on top of the PikeVM.

In fact, if/when I pick it back up again, I might reverse those steps anyway. Might be more satisfying to get something fully working first, and then generalize to byte regex after.

BurntSushi commented 5 years ago

For anyone watching this issue, I just released regex-automata, which is a lower level regex library. One of its features is the exposure of a DFA trait, which allows one to interact with a regex like you would with a traditional DFA. The search runtime is specifically lightweight, never allocates and never mutates. This makes it possible to write search routines on streaming input. However, while the crate doesn't yet expose any specific streaming APIs, they are possible to write in a first class way. Here is an example of the is_match search routine, which I think is fairly simple. In the future, I think adding streaming support directly to the crate would be a good idea, but it's a lot of work and wanted to have an opportunity for some iteration first.

Before getting too excited, you'll want to check out the list of differences between regex and regex-automata to see whether it's possible to use it. It's sadly probably not good enough for supporting a user facing text editor, for example, likely because of poor regex compile times.

cessen commented 5 years ago

Woah, awesome!

It's sadly probably not good enough for supporting a user facing text editor, for example, likely because of poor regex compile times.

Looking at the APIs, it looks like you have a "sparse" version that compiles faster and takes up less memory, but searches slower. Do you have any numbers on how much faster the sparse compiles are? Would they be usable in an interactive context?

BurntSushi commented 5 years ago

@cessen It would be helpful if you could point me to the place in the docs that mislead you to believe that a sparse DFA compiles more quickly. :-) A sparse DFA is internally built by first compiling a dense DFA and then converting that to a sparse DFA. Going to a sparse DFA would be quicker, but not an order of magnitude faster, and would increase code complexity.

Building out a full DFA where users might use large Unicode character classes without giving it a second thought is really inherently incompatible with fast compile times, as far as I can tell.

cessen commented 5 years ago

Oh, oops! Looking at the docs again, I have no idea how I got that idea in my head. Sorry about that.

This does make me think, though, that having a similarly low-level pike-style NFA regex library (which I assume would compile much faster) might be useful. I don't mean that as a suggestion for you, just musing to myself.

BurntSushi commented 5 years ago

This does make me think, though, that having a similarly low-level pike-style NFA regex library (which I assume would compile much faster) might be useful. I don't mean that as a suggestion for you, just musing to myself.

I do actually agree. I've intentionally never written down my game plan (mostly because I don't know how it will all shake out), but a lower level pike-style NFA library is actually on my list. I suspect that will be wind up being used as a compiler for both regex and regex-automata. Not 100% yet though. I don't just want this for the streaming use case, but I also want to put more focus on NFA generation itself and possibly more interesting uses of NFAs for matching. On top of that, it would be nice to have a regex-lite that has the same/similar API as regex, but with faster compile times but worse search performance.

BurntSushi commented 5 years ago

I've been working on a rewrite of the aho-corasick crate, and I've been thinking a bit more about this problem in the context of an Aho-Corasick automaton. It's a simpler use case, so I thought it might be nice to try my hand at it. Unfortunately, it's turned out to be fairly difficult.

More specifically, in my crate overhaul, I'm specifically trying to support standard Aho-Corasick match semantics (where a match is emitted as soon as it's detected), along with leftmost-first (what the regex crate uses) and leftmost-longest (what POSIX regex uses) match semantics. It turns out that supporting stream matching for the "standard" semantics is fairly simple since once a match is found, it can be emitted immediately. The search then restarts at the automaton's start state on the next byte.

The leftmost-{first,longest} match semantics, unfortunately, are quite a bit harder. In particular, it's possible for the automaton to move past the end of a match before detecting the match. For example, matching the regex a+z|a on a haystack of aaaaaaaaaaa using leftmost-first match semantics will actually cause the regex engine to consume all of the as in the input before yielding the first a as the final match. This is because if the string were aaaaaaaaaaaz instead, then the match would be the entire string since a+z has priority over a. In contrast, standard Aho-Corasick match semantics always takes the earliest match, so a match would be reported immediately after observing an a. Similar examples can be constructed for leftmost-longest match semantics.

One potential way of solving this is to declare that the size of any match can fit into memory (or more likely, impose a match length limit) and use a buffer as the automaton searches through the input. Contents preceding the end of the last match can be dropped, but any contents read into the buffer in the course of searching that appear after the last match need to be kept in the buffer for subsequent searching.

Mostly I just wanted to drop these thoughts here while they were fresh in my brain.

iago-lito commented 5 years ago

Hey. I'm late and new to the party. Having read this NFA post this morning and being myself enthusiastic about streaming regex searches, I keep.. feeling a bit lost among the variety of viewpoints around the regex world.

Maybe this thread is a good place to gather questions, discussion and resources on the topic, so I'm jumping in (stop me if it's not). For instance, I'm wondering:

I have this gut feeling that .* and .+ quantified wildcards in a stream regex must be lazy, or they can only nomatch (or remain stuck forever). As such, only .*? and .+? can be supported instead. Is this correct?
Would it make sense to define a new matching status, like a "ongoing match"? For instance, when the pattern abc+ is presented the stream:
- ab : I expect it to remain silent even though the stream "has a good smell"
- abc : it could remain silent and wait for a new character (for it may be another c). But why not throwing a "partial" or "ongoing match" event, just to warn the user that there is no doubt the pattern occurs, even though it is not entirely determined yet? What implications would this have on the underlying NFA. (more than 1 type of "matching" state I suppose)
- abcc : same.
- abccd: throw the final, true "match" again, or transform the ongoing match into an actual, "terminated match" (although it's 1 step late).
Since there seems to exist "leftmost-*" standards, how should the pattern a+?a+? react to stream aaa? (my brain hurted when I hit against this)

These are just pen&paper thoughts for now. I just expect feeling less alone because we can share them here :) My global feeling is that some strong structural, standard properties of regexes really need be rethought or redefined in the streaming context. It sure will be good to stick to (say) POSIX requirements as much as possible, but maybe a few modifications cannot be avoided.

BurntSushi commented 5 years ago

@iago-lito Yes, that line of thinking leads one to completely different match semantics. It's less about lazy vs. greedy, and more about reporting a match as soon as one is known. This is effectively what Hyperscan does---which was designed to support stream searching---and it's why match counts reported by Hyperscan differ quite a bit when compared against "traditional" regex engines. To drive home the point that lazy vs. greedy isn't really essence of this, consider cases like (foo){500,}, where you must see at least 500 occurrences of foo in a row in order to find the match. Similarly, consider the case of samwise|sam. In this regex engine, which uses leftmost-first semantics, this will match samwise in the text samwise. However, a match is known to occur as soon as sam is seen. Reporting a match at that point is effectively leftmost-shortest semantics, which obviously does not match leftmost-first (used by this regex lib, along with Perl compatible regexes) and also does not match leftmost-longest (used by POSIX regexes).

The next thing I'd like to do in this space is to look at how editors like Vim and Emacs implement regexes. Mostly because that seems to be the primary driving use case here: folks want stream searching in order to implement regexes on rope-like data structures.

iago-lito commented 5 years ago

I think you're right that it's one major driving use case indeed. My own toy project is rather focused on terminal input/output live parsing.. but the ultimate goal is script edition.
(Side note: maybe a clever handling of weird characters like \b would be desirable in any such use case i.e. where the input is produced by a human + keyboard. E.g. word should match worf\bd with no effort.)

The fact that leftmost-first is unconsistent with leftmost-shortest with stream data seems unavoidable. The fact that you need to wait 500 occurences of foo in order to declare a match of (foo){500,} seems unavoidable as well.

As a consequence, my intuition is that the very concept of "matching" could be (relaxed|extended) to something more (meaningless|rich) in terms of semantics. In my dream, patterns like samwise|sam and abc+ would just be translated to verbose automata that would only throw unquestionable events as the stream goes, and responsibility would be left to the user as to what to do with them. For instance with samwise|sam (say with g flag):

stream            | events / state (accessible to user)
------------------|-------------------------
-                 | nomatch
-s                | potential match instance (2 possibilities)
-sa               | potential match instance (2 possibilities)
-sam              | right match! (but maybe it's left)
-samw             | potential left instead of right
-samwi            | potential left instead of right
-samwis           | potential left instead of right + 2nd potential instance
-samwise          | left confirmed! right cancelled! + 2nd instance cancelled
-samwise-         | nomatch
-samwise-s        | potential match instance (2 possibilities)
-samwise-sa       | potential match instance (2 possibilities)
-samwise-sam      | right match! (but maybe it's left)
-samwise-samw     | potential left instead of right
-samwise-samwi    | potential left instead of right
-samwise-samwin   | left cancelled! right confirmed!
-samwise-samwino  | nomatch
-samwise-samwinot | nomatch

As for abc+ (with g flag):

stream            | events / state (accessible to user)
------------------|-------------------------
-                 | nomatch
-a                | potential match instance
-ac               | instance cancelled
-aca              | potential match instance
-acab             | potential match instance
-acabc            | match! instance undecided yet.
-acabc-           | instance decided! it was `abc`. nomatch.
-acabc-a          | potential match instance
-acabc-ab         | potential match instance
-acabc-abc        | match! instance undecided yet.
-acabc-abcc       | instance undecided yet
-acabc-abccc      | instance undecided yet
-acabc-abcccd     | instance decided! it was `abccc`. nomatch
-acabc-abcccd-    | nomatch

This looks like a lot of blurry information. But I trust that we cannot be more accurate on a stream. However, I also trust this information can be structured so that it would be reliable, easy to understand, to extend, fast to compute etc.

What do you think? This would come at the cost of more states types in NFAs and more static analysis of regexes in order to determine them (+ maybe reifying the concept of "pattern" vs. spawned match "instance"), but I'm sure it could be fun :P At least it would be honest for the user. IMO, the hardest part would be to make it flexible enough so that user would not pay performance costs for information they don't actually need.

(and it also means a lot of pen&paper first X)

BurntSushi commented 5 years ago

Honestly, my next step is to understand the use case better, which means looking at how this is done in vim and emacs. Your ideas seem neat, but they seem like they are more in the domain of a specialized regex engine, like Hyperscan.

iago-lito commented 5 years ago

Maybe they are. I will not beat about the Hyperscan bush any longer and get down to reading their doc :) Thank you for the feedback. I'll keep an eye around as I tinker. And thank you for the regex crate <3

ghost commented 4 years ago

I am late to the party too. I don't think I have the knowledge to write code for this, in fact I don't even know rust, I am just curious if this problem can be solved, becauser it is fun. I thought about the same naive approach you wrote in the starting post first, but that is just one constraint. I completely agree with BurntSushi that you might need to write a completely new engine to solve it. Now after a few hours I think that the whole problem boils down to choosing constraints.

Having a maximum size for the matching substring. For example you can do something like this with any regex engine: https://stackoverflow.com/a/19597011/607033 Or you can build it into the engine and dropping started match candidates by reaching the end of the buffer. I guess the bounded backtracker you mentioned applies the same constraint more or less. I guess in some cases no backtracking is needed, so it is a partial solution.
Running the engine through the regex at least two times as you proposed in your starting post. This can be solved by injecting a stream factory or by storing the entire data on the disk temporarily.

I think the API should support choosing from these constraints. I think the best solution would be setting memory usage limits and disk space usage limits or injecting a stream factory parameter, so the engine could decide which algorithm is the best fit for the given constraints. The stream factory might require writing a new engine, so probably dropping that for the sake of simplicity and extending the buffer with disk space is a good idea. What do you think?

BurntSushi commented 3 years ago

It looks like alacritty uses regex-automata to implement a form of matching on non-contiguous text.

BurntSushi commented 3 years ago

This PR by @archseer caused me to start thinking about this problem again: https://github.com/helix-editor/helix/pull/211

In particular, I'm thinking about a "cursor" abstraction that permits one to move forwards and backwards through the haystack, but otherwise does not generally offer random access. To make this concrete, I'd like to first start by explaining the current lowest level search API:

struct Match {
    start: usize,
    end: usize,
}

fn find(re: &Regex, context: &[u8], search_start: usize, search_end: usize) -> Option<Match>> { ... }

The key bit here is that re only searches at the bounds given by search_start and search_end inside of context. The reason for doing this is so that look-around assertions can be properly evaluated. For example, if you have the regex \bquux\b and want to see if there is a match at position 3 in fooquuxbar, then you might try this: re.find(&haystack[3..7]). But this would incorrectly report a match since the beginning and ending of a haystack are considered word boundaries. The key here is that when the caller does the slicing like this, it removes the ability of the searcher to correctly consider the surrounding context. So instead, we want to search like so: re.find(haystack, 3, 7). And that should indeed report no match.

There is no real way to drop this requirement because the regex engine requires the ability to compose things. For example, today, the lazy DFA is used to find most matches, and then the PikeVM is used to resolve capturing groups on only the span reported by the lazy DFA. But in order to do this correctly, the PikeVM needs to support the above API in order to correctly consider the surrounding context.

So with that said, I'd like to write down a way of abstracting over streams. We'll come back to the problem above. Here's a trait:

/// Cursor abstracts over the representation of a haystack that regex
/// engines use internally. In particular, it specifies the minimum
/// requirements that the regex engine needs in order to execute a
/// search.
///
/// When executing a search, the cursor should be positioned
/// immediately before the chunk at which to start the search. That is,
/// a forward search will begin by first calling `next`.
trait Cursor {
    type Error;

    /// Returns the current chunk of bytes for this cursor.
    fn chunk(&self) -> &[u8];

    /// Returns the offset corresponding to the start of the current
    /// chunk.
    fn pos(&self) -> usize;

    /// Advances the cursor to the next chunk of bytes, and if one
    /// exists, returns the true. When the haystack ends, this must
    /// return false and calls to `chunk` must return an empty slice.
    /// Subsequent calls must also return false and otherwise not cause
    /// the cursor to advance.
    ///
    /// In the case of an error, implementations should return false
    /// and ensure `error` returns the corresponding error.
    fn next(&mut self) -> bool;

    /// Advances the cursor to the previous chunk of bytes, and if one
    /// exists, returns the true. If the cursor is positioned at the
    /// beginning of the haystack, this must return false and calls to
    /// `chunk` must return an empty slice. Subsequent calls must also
    /// return false and otherwise not cause the cursor to advance.
    ///
    /// In the case of an error, implementations should return false
    /// and ensure `error` returns the corresponding error.
    fn prev(&mut self) -> bool;

    /// These report out-of-band errors if advancing the cursor failed.
    /// In general, the regex engine will not inspect these errors.
    /// Instead, callers are responsible for checking whether the
    /// cursor failed after a search or not.
    fn error(&self) -> Option<&Self::Error>;
    fn into_error(self) -> Option<Self::Error>;
}

This trait allows byte-at-a-time streams, but also permits implementations to feed chunks of bytes to the regex engine to reduce overhead and enable fast prefilter searches to work correctly. I believe it also permits implementing streams that do I/O by providing a mechanism to report errors, albeit out-of-band.

But there are some problems. I'm not sure if I can really make this work. Basically, my high level goal here is to write one search routine that works for the common case of "search a contiguous block of memory," but can also work on streams. If I can implement the search routines using the above interface, then of course, that code would support searching a contiguous block of memory: the cursor would just have three states: at the start (prev returned false), the bytes and at the end (next returned false). If I wind up having to duplicate all of the search routines, then I don't think that's a good maintenance-effort trade-off for me unfortunately.

So with that out of the way, here are some questions:

How are matches reported? Right now, the cursor is still kind of pretending like the bytes are all part of one slice. And in particular, requires one to provide the offset of the start of each chunk. But this offset isn't actually useful for anything inside the cursor or the search implementation, other than reporting the match location. But is this the correct way to do it? What about the chunk itself? If the match is in the ith chunk, then should that be reported too? (And if so, I see this as a problem, because now the chunk position ends up infecting the API of all search calls, even when searching contiguous memory.)
The regex engines sometimes want to look backwards in the haystack a few bytes. For example, to handle look-around assertions like ^ and \b. Using the API above, the regex engine would need to, e.g., call prev up to a max of 4 times in order to correctly handle \b, since the encoding of a Unicode codepoint might be 4 bytes and the minimum chunk size is 1. In order for the regex engine to get back to where it was, it actually has to keep track of the number of prev calls it made, and then make the corresponding number of next calls. That seems unwieldy.
In the case where DFAs are used, the end match location is first found via a forward search, and then the start of the match is found via a reverse search. But matches might span multiple chunks. So how do I execute the reverse search using the match offset reported by the forward search? I think the only way is to make sure the forward search guarantees the cursor is positioned such that the match occurs in the current chunk and we have to thread through a separate match offset that's relative to that particular chunk in order to know where to start the reverse search. The alternative here is to introduce more coupling at the search level I think, which I'd prefer not to do. (That is, instead of having one search routine for forwards and one for reverse and having them generally be blissfully unaware of each other, it might be more expedient to combine them.)

And then finally, there is the question of how to resolve the context question I posed at the beginning. You might think that the search API with a Cursor would look like this:

fn find<C: Cursor>(re: &Regex, haystack: C) -> Option<Match> { ... }

But now this loses the ability for the search to consider the surrounding context correctly. We can somewhat handle the context at the beginning of the search via the requirement that the cursor be positioned to where you want the search to begin. But that still isn't quite right, because you might not want to start the search at the beginning of the chunk. You might want to start it somewhere else inside the chunk. And of course, this doesn't address how to handle the context at the end of the search either.

So overall, this API for describing streams seems insufficient. At every turn we make, the search implementations seem to be begging for random access. And perhaps in some cases of Cursor, random access is indeed possible. But this greatly complicates the API depending on how exactly random access is communicated. e.g., Does it need to be a pair of (chunk_index, within_chunk_index), or can it just be a single absolute_byte_offset? Or maybe random access is just fundamentally incompatible with non-contiguous representations. I'm not sure.

If we abandon the idea of trying to adapt the regex engines to streams, and instead start with streams as our principal requirement, then some simplifications could be made I think:

Always use the PikeVM. Slower, but it sounds like this is what things like kakoune and vim do. (Or at least, what they do is simulate an NFA and that's what the PikeVM does.)
Drop the need for dealing with context. Instead, either all searches are over the full haystack, or you are OK with searches being incorrect on sub-haystacks. (I can imagine this being important for, e.g., wanting to search a range or selection of text in a text editor.)
Instead of trying to make the regex engines work on generic streams, fix the regex engine search routine to your specific non-contiguous representation. You might have additional powers that aren't appropriate to assume in the general case. Alternatively, make Cursor less general-purpose by targeting a common non-contiguous use case while leaving other use cases out in the cold.

While I was thinking about this in bed last night, I found the Cursor abstraction promising. But now, as I write this, I am becoming increasing convinced that folks will need to do their own bespoke implementations for this. And maybe a few years down the road, when a few of those implementations exist, we can potentially revisit this issue and see about trying to unify them.

But I do think there is one nice advantage here. Once #656 is done, at least for folks implementing text editors, I think you'll have a big advantage over text editors that roll their own regex engine from the parser on-up. Namely, you'll at least have a library that lets you build the NFA. At that point, all you have to do is write the search implementation, which is... comparatively little work when compared to the work required to get to that point in the first place. Namely, it is my goal with regex-automata to expose enough internals about each regex engine to make it possible to use it to write your own search implementation.

cessen commented 3 years ago

But now, as I write this, I am becoming increasing convinced that folks will need to do their own bespoke implementations for this. And maybe a few years down the road, when a few of those implementations exist, we can potentially revisit this issue and see about trying to unify them.

In general, I think this makes sense. Get some experience before committing the regex crate proper to any particular API/implementation. And the existence of regex-automata allows experimenting with what that might look like.

Having said that, I have a thought or two.

fn find<C: Cursor>(re: &Regex, haystack: C) -> Option<Match> { ... }
But now this loses the ability for the search to consider the surrounding context correctly.

Is that really true, though? Since the cursor can report it's current offset, something like this should still be doable I think:

fn find<C: Cursor>(re: &Regex, haystack: C, search_start: usize, search_end: usize) -> Option<Match>> { ... }

If search_start isn't in the current chunk, the find code can navigate to it as needed. That would result in inefficiencies for long non-contiguous text where search_start is very far from the current cursor position. But it would work correctly. And code that cares about the efficiency can make sure to pass a cursor that's close to search_start.

In order for the regex engine to get back to where it was, it actually has to keep track of the number of prev calls it made, and then make the corresponding number of next calls. That seems unwieldy.

There are a couple of ways I can think of to interpret this, one of which would make what I'm about to say redundant, so apologies if that's the case. But I'm pretty sure the regex engine doesn't have to track number of calls to prev()/next(), it just has to track the absolute offsets (independent of chunk). When it needs data at an offset that's not in the current chunk, then it calls prev()/next() as appropriate until it gets to the chunk it needs.

Having said that, efficiency-wise perhaps all that branching wouldn't be great for performance. But for an initial implementation I think(?) it would be pretty straightforward, and then further optimizations (e.g. doing a single chunk bounds-check for a whole set of operations) can be added incrementally to improve performance.

But maybe I'm missing something (wouldn't be the first time!).

But I do think there is one nice advantage here. Once #656 is done, at least for folks implementing text editors, I think you'll have a big advantage over text editors that roll their own regex engine from the parser on-up. Namely, you'll at least have a library that lets you build the NFA.

Honestly, I kind of feel like maybe this is just the way to go generally. I realize I'm essentially repeating something you already said, but yeah... text editors (and I'm sure other applications) often have their own unique requirements, so instead of trying to accommodate them directly in the regex crate with a cursor/stream API, it probably makes more sense to expand the number of engines provided in regex-automata so people can plug them together however they need.

So if you're going to do the work on that anyway, it's not totally clear to me that it's even worth changing the regex crate itself at that point.

Moreover, with those engines available, it wouldn't be hard for other people to put together separate (and narrower-scope) regex crates targeted at those various use-cases. And then regex itself can remain targeted at the more common case of contiguous text.

BurntSushi commented 3 years ago

@cessen Sounds good. I believe both of your points about the proposed Cursor API are correct. Although, it's still assuming that an absolute byte offset is a viable thing for a cursor to report. (I think it is, but I'm not sure.)

One thing that I should have been clearer about is that I was more-or-less talking about a Cursor API at the level of regex-automata, not regex. Unless there's some breakthrough in the design space here, yeah, I definitely do not see streaming APIs coming to regex proper. But I was trying to see if I could do it at the regex-automata level where somewhat more complex APIs are more readily justified.

cessen commented 3 years ago

Although, it's still assuming that an absolute byte offset is a viable thing for a cursor to report. (I think it is, but I'm not sure.)

I think at the very least a relative byte offset from some arbitrary point is reasonable to expect. And that ends up amounting to the same thing, because a Cursor can just start by reporting usize::MAX / 2 if it wants the flexibility to move in both directions from the first chunk it provides.

In other words, I think all it really requires from a Cursor is that it adds/subtracts the chunk lengths from a counter, which I think should be doable by anything...? And Cursors that do know their absolute offsets can still provide that.

One thing that I should have been clearer about is that I was more-or-less talking about a Cursor API at the level of regex-automata, not regex.

Ah, got it. Yeah, that's great too. :-)

8573 commented 3 years ago

In order for the regex engine to get back to where it was, it actually has to keep track of the number of prev calls it made, and then make the corresponding number of next calls. That seems unwieldy.

[…] I'm pretty sure the regex engine doesn't have to track number of calls to prev()/next(), it just has to track the absolute offsets (independent of chunk). When it needs data at an offset that's not in the current chunk, then it calls prev()/next() as appropriate until it gets to the chunk it needs.

Having said that, efficiency-wise perhaps all that branching wouldn't be great for performance.

I'm reminded of @Marwes's parsing library, which, if I recall correctly, handles backtracking by cloning and discarding its cursor-equivalent as necessary rather than having a single cursor that needs to be re-advanced after being rewound.

BurntSushi commented 3 years ago

@8573 Interesting. I guess the implementation of Cursor would need to specifically support that somehow? e.g., I'm thinking of how someone might implement this for std::fs::File and seek calls. I guess you would probably want to use pread instead and that would probably work nicely?

cessen commented 3 years ago

[...] which, if I recall correctly, handles backtracking by cloning and discarding its cursor-equivalent as necessary [...]

I'm not sure if requiring Cursors to implement Clone is a requirement we want to impose...?

I'm imagining, for example, a Cursor that's built on top of a network stream. It obviously wouldn't work for backtracking, and would therefore error out with e.g. the DFA engine when you want the full match range. But you'd ideally still like to use it with the NFA engine, or even the DFA engine when only testing for match existence.

BurntSushi commented 3 years ago

@cessen Even the NFA simulation requires looking backwards a few bytes though. The idea of a Cursor is that it doesn't support streams in the strictest sense, but instead provides just enough power to make most of the regex engines workable. If we wanted to support fully general streams, then there would be no Cursor trait I think. We would just accept a io::Read impl. In that case, I think you could make the NFA simulation work at least by buffering. Since look-behind is capped at a small number of bytes, that could be made to work for the PikeVM specifically I think. But with a Cursor, you get more power, and thus the ability to do backtracking or even reverse DFA searches.

(Of course both use cases are important. I was exploring the more limited approach with Cursor to see how far we could get there. A Cursor is essentially an abstraction over discontiguous strings, but where forward and backward iteration is possible, so, i.e., not a stream.)