rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.53k stars 440 forks source link

implement one-pass NFA matcher #68

Closed BurntSushi closed 1 year ago

BurntSushi commented 9 years ago

It turns out that when a regex is deterministic, the NFA simulation can be implemented much more efficiently because it only needs to keep track of one set of capture groups (instead of a set of capture groups for every thread).

There are two components to adding this:

In terms of code, it would be nice if we could find a way to reuse code with the full NFA simulation.

This should be easier to implement than #66, and should boost the performance of a lot of regexes. (Of course, we should do both a one pass NFA and #66!)

SeanRBurton commented 9 years ago

Can you please elaborate on what exactly you mean by this, and perhaps point to some literature and/or examples?

BurntSushi commented 9 years ago

I'm not aware of any literature on this topic, and as far as I know, Russ Cox came up with it. He talks about it briefly here: https://swtch.com/~rsc/regexp/regexp3.html (CTRL-F for one-pass).

The definition in my OP is the key: a one-pass NFA can be used to evaluate a regex where, at any point during the evaluation of the machine, at most one transition can lead to a match. So if you know your machine has that property, you don't need to keep track of as much state as the full NFA simulation.

The last time I looked at implementing this, the tricky part I think was actually detecting one-pass regexes. It can probably be simplified by starting with a routine that reports a lot of false negatives (false positives cannot be allowed).

SeanRBurton commented 9 years ago

Thanks for the info; what I was missing is that it can only be used for anchored matches, so that we do not have to introduce a new thread of execution for each input character.

BurntSushi commented 9 years ago

Right. Unanchored regexes have a .*? put implicitly at the beginning, which immediately defeats the one-pass optimization.

jneem commented 9 years ago

FWIW, the regex-dfa crate makes it trivial to check whether the regex is one-pass: the DFA created by regex-dfa is minimized, and so it's one-pass iff it has no branches.

I'm not sure how to go from there to a one-pass NFA, though. Specifically, you'd need to get back the capture groups somehow.

ethanpailes commented 6 years ago

How does a 1-pass NFA simulation compare with an NFA simulation that does lookahead before spawing new threads at branch points? It seems like the advantage of a 1-pass simulation is that it will never spawn more than 1 thread, because there are no non-deterministic branches. Adding lookahead (or [previews][previews] to generalize even further) is an optimization that other engines sometimes do, and it seems to me like an engine that does lookahead at every branch will never spawn more than 1 thread for a 1-pass NFA.

I bring this up because I keep playing with the idea of trying to impliment previews or lookahead (no promises), but this 1-pass NFA thing also seems like a pretty easy optimization to do and I'm wondering if anyone has any insight about whether or not these optimizations are a 1-or-the-other sort of thing.

BurntSushi commented 6 years ago

It seems like the advantage of a 1-pass simulation is that it will never spawn more than 1 thread

Correct. Thereby eliminating the overhead of managing and tracking threads. A one-pass NFA implementation uses this fact to avoid creating threads in the first place.

and it seems to me like an engine that does lookahead at every branch will never spawn more than 1 thread for a 1-pass NFA.

I don't follow this. A one-pass NFA will never spawn more than one thread, period. Lookahead/preview have nothing to do with that.

Adding previews permits you to limit the creation of threads in the general case for a non-one-pass NFA.

(I suspect I am misunderstanding your query.)

ethanpailes commented 6 years ago

and it seems to me like an engine that does lookahead at every branch will never spawn more than 1 thread for a 1-pass NFA.

I should have said "1-pass regex". You are 100% right that a 1-pass NFA can't spawn new threads. Sorry for the confused communication.

To unpack my point a bit more using the examples that Russ Cox provides on his website. /([^x]*)x(.*)/ is one-pass. An engine that does lookahead before spawning new threads on a repetition will never spawn more than one thread for this regex because the 1-previews of /[^x]*/ and /x/ are disjoint. It seems like a 1-pass NFA would have to do basically the same thing, while being a bit less general, but I'm worried I'm missing something subtle.

BurntSushi commented 6 years ago

@ethanpailes Hmm. I'm still having trouble understanding the essence of your question. Note that the determination of whether a regex is one pass or not is made at compile time, and that information is then used to switch to a completely different implementation of the Pike VM that doesn't use threads at all. In that implementation there would be no previews/look-ahead, and in fact, there would be no point at all in using previews/look-ahead for that case since you would only ever be using one thread.

ethanpailes commented 6 years ago

Oh, now that you mention the PikeVM I realize that the lookahead version would have to do lots of extra work to copy capture group info around because it wouldn't know that it was only ever going to spawn 1 thread. I think my thinking was muddled by thinking more about the bounded backtracker. Thanks!

ethanpailes commented 6 years ago

So I've been quitely spinning my wheels on this for the last month or so, and after a few embarrassing false starts I think I've gotten something worth talking about in public. I'm a little confused by the terminology surrounding onepass matchers, because it seems to me that they are only applicable when a regex has no non-determinism, in which case a DFA can impliment capture groups just fine, so that's what I implimented. The code I linked is a pretty good proof of concept but there are lots of outstanding issues.

I'm not entirely sure what the execution planner should do with onepass dfas yet. Onepass DFAs are the most useful where we would otherwise have to use an NFA to extract capture groups, but there is a chance that they could do better than the lazy DFA because they should have lower warmup cost. When a regex has no non-determinism, there can never be an exponential blowup in space (no non-determinism means no compound states from the powerset construction), so I just AOT compiled the FSM.

Anyway, here's what happens when I ran all the benchmarks containing capture groups.

 name                           old.bench ns/iter     new.bench ns/iter     diff ns/iter   diff %  speedup 
 misc::short_haystack_1000000x  149,041 (53676 MB/s)  146,018 (54787 MB/s)        -3,023   -2.03%   x 1.02 
 misc::short_haystack_100000x   15,449 (51783 MB/s)   13,817 (57900 MB/s)         -1,632  -10.56%   x 1.12 
 misc::short_haystack_10000x    4,024 (19883 MB/s)    1,259 (63551 MB/s)          -2,765  -68.71%   x 3.20 
 misc::short_haystack_1000x     696 (11510 MB/s)      281 (28508 MB/s)              -415  -59.63%   x 2.48 
 misc::short_haystack_100x      452 (1794 MB/s)       192 (4223 MB/s)               -260  -57.52%   x 2.35 
 misc::short_haystack_10x       400 (227 MB/s)        157 (579 MB/s)                -243  -60.75%   x 2.55 
 misc::short_haystack_1x        391 (48 MB/s)         147 (129 MB/s)                -244  -62.40%   x 2.66 
 misc::short_haystack_20x       425 (402 MB/s)        174 (982 MB/s)                -251  -59.06%   x 2.44 
 misc::short_haystack_2x        405 (66 MB/s)         162 (166 MB/s)                -243  -60.00%   x 2.50 
 misc::short_haystack_30x       421 (596 MB/s)        180 (1394 MB/s)               -241  -57.24%   x 2.34 
 misc::short_haystack_3x        401 (87 MB/s)         149 (234 MB/s)                -252  -62.84%   x 2.69 
 misc::short_haystack_40x       432 (766 MB/s)        187 (1770 MB/s)               -245  -56.71%   x 2.31 
 misc::short_haystack_4x        395 (108 MB/s)        151 (284 MB/s)                -244  -61.77%   x 2.62 

The bigger versions show less speedup because there is a literal scan optimization happening, so as the input grows, less and less time is spent in the main matching engine.

ethanpailes commented 6 years ago

I know this is a pretty big wad of code and issues to cough up, so if you are interested in triage: comments about the EmptyLook stuff and pontification on execution planning would probably be the most helpful to me right now.

BurntSushi commented 6 years ago

That looks like awesome work! I'm not sure when I'll be able to review it, but it might be a good idea to just open a PR when you think it's in a reasonable mergeable state. It is a lot of code though, and unfortunately i won't be able to merge it until I'm confident that I understand it thoroughly.

I'm not sure i understand your question about EmptyLook though.

ethanpailes commented 6 years ago

I'm not sure i understand your question about EmptyLook though.

That's fine. I'll just implement EmptyLooks with an intermediary state. I figured there was about a 50% chance that I wasn't being entirely coherent about flags.

I fully expect this to take a long time to merge even after I open the PR. Considering that this is a whole new backend, getting things to a point where you are comfortable with all of the code is very important. A PR is probably a ways out though.

BurntSushi commented 6 years ago

Sounds good! The way flags are handled in the DFA is incredibly finnicky. They have been a source of bugs. The idea is to build them into the DFA, but the code isn't particularly forthcoming. And any time I need to tweak it for a bug fix or something, I need to re-learn how it works. So... simplifications are welcome. :-)

A key part of this will be testing. One potentially problematic area here is that it's not clear how well the existing test suite will cover the new matcher. Today, we kind of get away with things because each backend can handle a very large subset of the tests, and the tests are actually parameterized on the different backends. This might be a little trickier to pull off with a backend that works on fewer regexes. If it's possible to follow the pattern that other backends use, but skip tests that aren't applicable, then that might help. If we end up skipping too many, then we know we need better coverage.

ethanpailes commented 6 years ago

And any time I need to tweak it for a bug fix or something, I need to re-learn how it works. So... simplifications are welcome. :-)

I did get kinda scared off when I tried to grok that part of the DFA code. I'm glad I'm not the only one who finds it tricky :-)

One potentially problematic area here is that it's not clear how well the existing test suite will cover the new matcher. Today, we kind of get away with things because each backend can handle a very large subset of the tests, and the tests are actually parameterized on the different backends. This might be a little trickier to pull off with a backend that works on fewer regexes. If it's possible to follow the pattern that other backends use, but skip tests that aren't applicable, then that might help.

That's a good point. For now I just set the execution planner to fall back on an existing NFA simulation if the regex to be executed is not onepass, but I've made no effort to figure out how many tests involve onepass regex. I'll definitely try to get a count of that at some point. If not enough test cases light up I think re2 has some automatic testing ideas ripe for stealing, but let's burn that bridge when we get to it.

ethanpailes commented 6 years ago

@BurntSushi, while I was working on this I began to worry about alternations containing empty branches introducing subtle sources of non-determinism[1], so I wrote a few test cases. To my delight, it seems that they are currently not supported, so I don't have to think about it. I don't want to rely on that remaining the case long term though. It seems like there are two options. (1) commit to just keeping this restriction. The only time you would need empty branches over an optional alternative is if you want to very carfully control the precidence of the empty branch, which seems super niche. (2) Just remain mindful that the onepass matcher (specifically the bit that checks a regex to see if it is onepass) will need some TLC when this restriction gets lifted. I don't think the choice has any impact on my near term actions, so this comment is mostly just a way to air a potential footgun in public.

[1]: (e1|e2|)e3 == (e1|e2)?e3. In both cases you have to check for non-determinism between all three of the expressions.

ethanpailes commented 6 years ago

I've gotten to the point where all tests are green except for this one. It looks like the right answer is a zero-width capture right past the end of the input, but I don't understand what is going on with that \u{7EF5E} unicode value very well or why it should be special. Right now I'm just implimenting unicode support by using the .bytes() flag on the compiler and then just translating the resulting embedded DFAs, so I don't know why this case is special. Obviously I'm not asking you to do my debugging for me, but it might save me some time if I could understand what the test case is trying to do a bit better.

BurntSushi commented 6 years ago

w.r.t. to empty alternates, it was definitely my intention to allow them with the regex-syntax overhaul. It wasn't until I actually let them through the compiler that I realized the current compiler didn't know how to handle them. But it is definitely possible. I didn't want to spend the time teaching the compiler this, so I just banned them until a refactoring fixes it.

Empty alternates are a somewhat niche feature, so if the onepass DFA can't handle them or it's hard to handle them, then simply using that as a criterion to exclude the onepass DFA sounds great to me. Please note though that there are other ways for empty alternates to manifest themselves in a way that the compiler can handle. e.g., I think (e1|e2|())e3 might be allowed today. So you'll want to check that case I think.

I've gotten to the point where all tests are green except for this one. It looks like the right answer is a zero-width capture right past the end of the input, but I don't understand what is going on with that \u{7EF5E} unicode value very well or why it should be special.

I think the specific codepoint is a red herring, but rather, the important bit is to exercise \B in the face of a codepoint encoded by multiple code units. \B has been a persistent source of bugs, and its behavior with and without Unicode support enabled is really subtle. I don't have time to think through it right now, but \B is pretty rarely used in my experience, and if you don't want to think through it (at least initially anyway), it is more than acceptable to reject onepass if the pattern contains a \B.

ethanpailes commented 6 years ago

Please note though that there are other ways for empty alternates to manifest themselves in a way that the compiler can handle. e.g., I think (e1|e2|())e3 might be allowed today. So you'll want to check that case I think.

As I was migrating the code over from my research branch to make a pretty PR I realized just this and implemented some more principled checks for regex which might accept the emptystring (rather than just special casing an empty branch or a captured empty branch). I think this should mean that adding support for empty alternatives won't cause any trouble. Sorry for the noise.

Thanks for unpacking what is going on in that test case for me! Hopefully it will help me when I try to dig into what is going wrong.

BurntSushi commented 2 years ago

@ethanpailes OK, so I've finally implemented a one-pass engine.

Docs: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/dfa/onepass/struct.DFA.html

Implementation (WIP branch, do not use): https://github.com/BurntSushi/regex-automata/blob/51a69e2520e989d4788eadf848910b58b897e0cd/src/dfa/onepass.rs

I want to say that I'm sorry. I think my instincts led you down a very bad path. I think my biggest mistake was pushing you toward an up-front analysis pass on the HIR to determine whether the regex was one-pass or not. The big red flag here was that this resulted in things like \w not being one-pass. But in reality, every character class should be one-pass. The issue is that the naive way of building a UTF-8 automaton from a character class results in a construction that is not one-pass. So in order to do this analysis part correctly, you actually have to go out and build the UTF-8 automaton or do some other sophisticated thing at the rune level or just declare that all classes are one-pass I guess.

The other part I think I was wrong about was using a DFA. We do indeed want a DFA here. Since it's limited to at most the number of states in the NFA, its memory usage shouldn't be too bad. And we can put caps on how much memory it's allowed to use. Moreover, I think it really needs to be a DFA to achieve a compelling speed advantage.

Anyway, sorry for steering you down the wrong path. I had started going down that path too, but the more I looked at and started working on the problem the less sense it made.

In the end, I ended up with an implementation that is pretty close to RE2's. It's different in some ways (it supports searching for multiple patterns), but the essential shape of the implementation is very similar.

ethanpailes commented 2 years ago

At this point it's been long enough since I implemented it that I can't remember if I implemented the filter in terms of the HIR or the bytecodes first. I think it makes sense to do everything in terms of the bytecodes so you don't have to worry about utf8 as much. I remember that the biggest difference between my implementation and RE2s was that mine had variable width state entries in the DFA which made it more complicated but let it support more capture groups. Did you wind up doing fixed sized states or variable sized?

BurntSushi commented 2 years ago

Did you wind up doing fixed sized states or variable sized?

Fixed size. Although I used a u64 for each transition instead of a u32 like RE2. So, we use a bit more memory but can support up to something like a dozen capture groups. (RE2 can only support something like a half-dozen I think?) The regex crate also has more look-around assertions, so it ends up needing more bits for that too. The benefit is of course that fixed size states give us faster transitions. But yeah, a sparse variant could also be added if there are some compelling use cases for it.

ethanpailes commented 2 years ago

Yeah I think that approach makes more sense. Getting the variable sized states to work was a pain. I think I was being too completionist about that.