rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.54k stars 442 forks source link

implement a DFA matcher #66

Closed BurntSushi closed 8 years ago

BurntSushi commented 9 years ago

One of the reasons why RE2/C++ is so fast is because it has two implementations of regex matching: a limited DFA matcher (no sub-capture support) and a full NFA simulation. This crate has the latter, but not the former.

Adding a DFA matcher should be an implementation detail and shouldn't require any public facing changes.

This is a pretty involved project. I hope to find time to do this some day, but if someone else wants to tackle it, I'd be happy to help mentor it. (Part of this will be figuring out how to handle the regex! macro. Do we replicate the DFA there too like we do the NFA?)

WillEngler commented 9 years ago

I'll do it!

@BurntSushi recommends that I look into an existing DFA engine by @carllerche.

I'll be starting by getting familiar with the code in this repo and RE2.

carllerche commented 9 years ago

Feel free to ping me on IRC or wherever if you want to talk about carllerche/automaton

WillEngler commented 9 years ago

Here's a summary of some research I've done and my intended next steps.

TL;DR

The algorithms that @carllerche uses in carllerche/automaton make a different time/space tradeoff than that made by RE2 and rust-lang/regex so far. For now, I recommend that we keep modeling rust-lang/regex after RE2 and use the same method of lazy DFA construction used in RE2 rather than incorporate Carl's work. However, using Carl's work to construct a DFA at compile time with the regex! macro could be a killer future enhancement.

A Comparison of rust-lang/regex, RE2, and carllerche/automaton

I consulted The Dragon Book and Russ Cox's blog series on implementing RE2 for background on algorithms for constructing and simulating automata. I'll refer to algorithms by numbers given to them in the Dragon Book for anyone who wants to take a closer look.

Next Steps

I say we keep modeling rust-lang/regex after RE2 and implement lazy DFA construction rather than ahead-of-time DFA construction. I see this as taking two big steps:

  1. Implement DFA simulation.
  2. Work out when to call on the DFA simulator and when to call on the NFA simulator depending on the question the client is asking (Is there a match? Where is the match? etc.).

Compilation of a regex into VM instructions would be unchanged.

However, the ahead-of-time approach could still be useful. Constructing a DFA at compile time with the regex! macro would be crazy cool. But I think that can wait :smile:.

Questions

@carllerche: I know you and @BurntSushi were talking about folding in carllerche/automaton. Am I misunderstanding anything? Does this seem reasonable to you?

Also, many thanks to @BurntSushi for being so helpful to this Rust newbie.

BurntSushi commented 9 years ago

@WillEngler Sounds good to me! w.r.t. to regex!, that can be separately engineered, but its future is certainly exciting! :-)

(My new branch has some basic support for picking a matching engine on-the-fly, but the DFA is a bit more complicated since it can be beneficial to run the DFA first and then the NFA to find submatch boundaries. But this doesn't have to be in the first pass! A simple match-only DFA seems like a good initial goal to shoot for.)

carllerche commented 9 years ago

@WillEngler sounds accurate. My goal was to target the case of optimizing runtime at the expense of compile time with the assumption that it is easy to build the regex once at boot so trading off compile time vs. execution time is desirable.

WillEngler commented 9 years ago

Great! Then I'll get to work on a match-only DFA.

WillEngler commented 9 years ago

DFA Sitrep

I don't have anything to show for this week. If last week was about learning the theory of DFA construction, this week was about learning the implementation details. I traced and annotated Russ Cox's educational bounded-memory DFA implementation and his production implementation. I also followed @BurntSushi's major refactoring (#91) as I learned more about rustlang/regex's internals.

Expect to see some action in WillEngler/regex this week.

(Is this proper Github etiquette? I thought it would be helpful to give an update because I haven't shipped anything this week.)

BurntSushi commented 9 years ago

@WillEngler Updates are really good, particularly when you're still getting your bearing. If you share info with others, then you can benefit from their experience! In any case, sounds like you're on the right track. :D

BurntSushi commented 9 years ago

@WillEngler Oh, and please don't feel pressure to "ship" stuff!

WillEngler commented 9 years ago

@BurntSushi Thanks, I needed that :smile:. I saw you contribute so many great additions so quickly in the refactor and I got a little intimidated. But as a beginner, it's unreasonable to expect myself to move that fast.

Onwards!

WillEngler commented 9 years ago

@BurntSushi I'm going down the rabbit hole of UTF-8 encoding and would like to make sure I'm going in the right direction.

As I understand it, each DFA state will need a collection of pointers to other DFA states in order to form a graph. We want to know the number of pointers when we allocate the DFA state. Obviously, one pointer per UTF-8 codepoint isn't viable.

RE2's approach is to maintain a 256-bit bitmap during compilation that records borders between character classes that are treated identically. Then, when constructing the DFA engine, it creates a mapping from [0, 255] -> [0, num_character_classes]. This map is used to pick the right pointer when walking/lazily constructing the graph. Here's an example from Russ Cox:

A regexp looking for a[abc] would have four byte ranges -- 0 to 'a'-1, 'a', 'b' to 'c', and 'c' + 1 to 0xFF.

RE2 gets away with only caring about 256 values by compiling multi-byte UTF-8 codepoints into mini-automata. (I think. I haven't scrutinized that part of the code enough to fully grasp it.)

Given Rust's approach to Unicode, it doesn't seem natural to compile codepoints down to their bytes. So here is the approach I'm imagining:

I have one major reservation. In RE2, the number of possible character ranges is bounded above at 256. In the approach I described, the upper bound is the number of Unicode codepoints. Are there use cases that would have many (say, > 1000) character classes and cause the size of the pointer array to blow up? And overall, does this approach seem reasonable?

BurntSushi commented 9 years ago

@WillEngler I was actually thinking about this problem a couple mornings ago, but I hadn't gotten a chance to write up my thoughts. Thank you for bringing it up!

My first thought is this: we shouldn't let the way Rust handles strings influence how matching engines are written. In particular, if it is easier to write a faster engine byte-by-byte instead of character-by-character, then we should do that. If we expose the same interface that we do today, then we can assume UTF-8 input (which is rather convenient). Doing this will, as you pointed out, require changing a fair bit so that UTF-8 decoding is built into the automaton generated by the compiler. The RE2 code to do this doesn't look that bad. I think this also means that we could collapse the Char and Ranges instructions into one, after some small re-factoring on how case insensitivity is handled.

My second thought is that this could potentially introduce some problems in corner cases. For example, consider the \b instruction. It requires one character look-behind. In a byte matching engine, this could be up to 4 bytes. It is Unicode aware in this crate, but it is not in RE2. I would like to preserve this property, not only because it seems correct but because a regression on Unicode support would not be nice. I don't know if it is problematic to implement this in the DFA. If it is, then I would choose "DFAs cannot be used with regexes that have a \b" over "change \b to not be Unicode aware." I'm not entirely sure whether this plays nicely with existing matching engines either. My hope is that it can just do what it does today: decode the previous character at the current input position. If the previous character is not valid UTF-8, then that instruction would simply fail at that position.

My third thought is that you should not take my thoughts for gospel. I am not saying, "to get a DFA implemented, we must switch to byte matching engines." :-) For example, if it is easier to ignore \b for now, then we can simply bar the DFA from running on any regexes with a \b instruction.

OK, with that out of the way, I'll actually respond to your thoughts. :-)

As I understand it, each DFA state will need a collection of pointers to other DFA states in order to form a graph. We want to know the number of pointers when we allocate the DFA state. Obviously, one pointer per UTF-8 codepoint isn't viable.

Yes! One thing I would consider if I were you is the choice of whether to actually use a pointer or not. My suspicion is that using pointers might require use of unsafe (or Rc) to manage multiple ownership, since you'll have multiple paths to the same state in the DFA. Instead, you might consider storing the states sequentially in memory (e.g., Vec<State>), and refer to them by index. This preserves a nice single-ownership scheme. (I used this representation in my Aho-Corasick implementation.) Another nice property of this is that you might not need the "class" indirection because you could use a much smaller "pointer" type. e.g., You could declare that the DFA is limited to 65,536 states, which would yield an overhead of 512 bytes per state, which, to be fair, is still more than the ~256 bytes per state used in Cox's formulation. (This is better than Cox's worst case, but I can't imagine the worst case happening too often.)

On an unrelated note... When I first published regex, I had the phrase "UTF-8 codepoint" in the public docs in several places. Shortly thereafter, someone submitted a pull request changing this to "Unicode codepoint." I hadn't realized the distinction until that point, but the correction was useful because it forced me to think more carefully about Unicode the abstraction and Unicode the implementation. (To get really pedantic, the regex crate operates on Unicode scalar values, since Rust strings do not permit surrogate codepoints!) Because I found this distinction useful, I'd like to pay it forward to you. :-)

RE2 gets away with only caring about 256 values by compiling multi-byte UTF-8 codepoints into mini-automata. (I think. I haven't scrutinized that part of the code enough to fully grasp it.)

This is my understanding as well. I understand it at a high level. I think I'd need to work through an example or two before I'd really get it.

I have one major reservation. In RE2, the number of possible character ranges is bounded above at 256. In the approach I described, the upper bound is the number of Unicode codepoints. Are there use cases that would have many (say, > 1000) character classes and cause the size of the pointer array to blow up? And overall, does this approach seem reasonable?

Hmm. So I think terminology is getting a bit conflated here. The term "character class" has a well-defined meaning in regex land, but here, I think you're using it to refer to a specific implementation detail. (How about, "state character class"?) If so, I think that is a problem. e.g., Consider \pL[a\pL]. In that regex, every character in the Unicode class \pL will need its own state character class. This is quite large!

I haven't been able to come up with a way to work around this without either moving to byte matching or making unacceptable sacrifices in memory usage. :-/

(Go's regexp package---which uses the same character based engines that we use---has a nice trick similar to this in its one pass NFA engine. It's effectively the same as your state character class idea, but it's pushed down into each state. So in the main matching loop, it has to do a binary search over the classes---which are ranges of bytes---in order to get the next state. However, it is only able to do this because it has determined the that regex is one pass. Namely, given a position in the input, there is at most 1 possible next state that can lead to a match.)

WillEngler commented 9 years ago

@BurntSushi A lot to chew over here!

For example, consider the \b instruction. It requires one character look-behind. In a byte matching engine, this could be up to 4 bytes. It is Unicode aware in this crate, but it is not in RE2.

I hadn't considered that. The Unicode-unaware behavior seems really unintuitive. I had to go look at RE2 again to convince myself that it would act so strangely.

I would like to preserve this property, not only because it seems correct but because a regression on Unicode support would not be nice. I don't know if it is problematic to implement this in the DFA. If it is, then I would choose "DFAs cannot be used with regexes that have a \b" over "change \b to not be Unicode aware." I'm not entirely sure whether this plays nicely with existing matching engines either. My hope is that it can just do what it does today: decode the previous character at the current input position. If the previous character is not valid UTF-8, then that instruction would simply fail at that position.

I'm in complete agreement.

Hmm. So I think terminology is getting a bit conflated here. The term "character class" has a well-defined meaning in regex land, but here, I think you're using it to refer to a specific implementation detail. (How about, "state character class"?) If so, I think that is a problem. e.g., Consider \pL[a\pL]. In that regex, every character in the Unicode class \pL will need its own state character class. This is quite large!

Overloading "character class" is definitely a bad idea :) . \pL[a\pL] does indeed blow up the idea of state character classes, but not as badly as you claim (I think). As I understand it, there would only be one state character class per consecutive group of letters. So the uppercase Cyrillic letters Я and Д would be in the same state character class because they are members of a group of consecutive Unicode codepoints that are treated identically in this regex. The one exception is that a would be a separate state character class from b - z. In any event, that's an enormous amount of state character classes (see RSC's diagram of the mini-automaton that comes from compiling \p{Lu}). Still, the distinction isn't entirely academic because the same reasoning applies to "byte ranges", RSC's term for groups of bytes that are treated identically in a regex and thus receive one DFA pointer per range.

My third thought is that you should not take my thoughts for gospel.

Of course :)

I am not saying, "to get a DFA implemented, we must switch to byte matching engines." :-) For example, if it is easier to ignore \b for now, then we can simply bar the DFA from running on any regexes with a \b instruction.

On that note, here is how I think I ought to go forward.

  1. I experiment with byte matching as implemented in re2/compile.cc. At first, I blithely ignore the fact that /b won't behave as it should.
  2. Once I have something working, I find a way to make look-arounds Unicode aware again. I don't think that will be too hard (famous last words). I hope that after implementing the naive version, it will be clear how to work it in.

It will be instructive for me to make an experimental feature in the compiler so that I can understand that part of the crate better. Also, it helps that this will be a task with a comparatively low surface area. I can have a small victory to feel chuffed about before tackling the DFA engine proper.

Sound reasonable?

Miscellaneous thoughts:

... the correction was useful because it forced me to think more carefully about Unicode the abstraction and Unicode the implementation ...

Thanks for paying that forward! I always appreciate being corrected on terminology.

Go's regexp package---which uses the same character based engines that we use---has a nice trick similar to this in its one pass NFA engine.

Interesting. I'll have to take a look. The Go implementation might be easier for me to use as a reference than the C++ implementation.

Yes! One thing I would consider if I were you is the choice of whether to actually use a pointer or not. My suspicion is that using pointers might require use of unsafe (or Rc) to manage multiple ownership, since you'll have multiple paths to the same state in the DFA. Instead, you might consider storing the states sequentially in memory (e.g., Vec), and refer to them by index.

Thanks for that suggestion! Tips on how to write effective Rust are always welcome.

BurntSushi commented 9 years ago

I think everything sounds good. If you wanted to try out your state transition character class idea before byte based matching engines, then I'd support that too. (This will get you to DFA-land faster.)

OK, so here are my thoughts on byte based matching engines.

  1. Most of the work is going to be in the compiler. Borrowing ideas from RE2 is good.
  2. There will be some work in the existing matching engines. My hope is that you can replace OneChar and CharRange with ByteRange (same as RE2). I'm worried that \b may still be tricky, but it does seem like my idea above (attempting a decoding) should work.
  3. Case insensitivity is going to jump up and bite you. Right now, all character classes are normalized to their simple case foldings. For example, given (?i)[a-zA-Z], it will be translated to [a-z] with the casei flag set. To match, the input character has to be case folded first. In order to support byte ranges, you can't operate at the Unicode character level, so you'll need to change this normalization process so that all cases are represented. In other words, case folding becomes part of the automaton! It will use more memory, but I'm not that concerned about it. (I frankly shouldn't have designed it this way in the first place.)

You might choose to ignore (3) and hope that I can do it in a timely manner. :-)

WillEngler commented 9 years ago

Sounds good! Ideally, I'll handle case insensitivity myself. But as for doing things in a timely manner, I'm more likely to be the slow one :)

golddranks commented 9 years ago

Btw. have you checked the "Regen" Regex engine? Homepage: http://sinya8282.github.io/Regen/ Github: https://github.com/sinya8282/regen

It uses JIT and parallerization, but according to this paper[1], even WITHOUT them, it manages to outspeed RE2 with a considerable marginal.

[1] http://www.is.titech.ac.jp/~sassa/papers-written/shinya-regen-postergenkou-PROSYM12.pdf (Sorry, Japanese only, but the abstract, graphs and numbers are understandable.)

WillEngler commented 9 years ago

@golddranks No, this is the first I've heard of it. Thanks for the links!

I would need to take a lot of time to understand their approach because I don't know the first thing about JIT compilation. So I'm going to make a note to myself to revisit this later.

BurntSushi commented 9 years ago

@golddranks Thanks! But I fear that without an English description of the work they've done, it will be hard to benefit from it. JITing and parallelization seem like optimizations that come after initial work on a DFA is done. :-)

WillEngler commented 9 years ago

Progress Report

I'm chugging away at the byte matching engine. The trickiest part has been deciphering laconic bit-twiddling code like this to figure out what approach RE2 took. I've made some decent progress. I have pieces of a module that can take in a character range and spit out byte ranges that are ready to become instructions. So far, that means splitting ranges along the length of their UTF-8 encoding and further splitting them so that each range shares the same leading bytes. It doesn't work yet. I still need to write up tests and do some serious debugging. And some design changes are probably in order.

There is much left to do after that! Here is the order I want to tackle things:

  1. Fitting a new instruction type into the main compiler loop.
  2. Getting lookarounds to work.
  3. Case folding.
  4. Optimizations. For example, RE2 caches byte suffixes, but I've put that off for now.
BurntSushi commented 9 years ago

@WillEngler Looks great so far! Case folding could get tricky, as you'll need to make modifications to regex-syntax I think. (You will cause existing tests to fail, which will have to be fixed.)

Otherwise, yes, putting off optimizations like caching is good. Make it work. Make it right. Make it fast. :-)

BurntSushi commented 9 years ago

@WillEngler It looks like rsc also wrote a DFA in Go with byte instructions like in RE2. Here's the UTF-8 automaton code: https://github.com/google/codesearch/blob/master/regexp/utf.go

WillEngler commented 9 years ago

@BurntSushi Thanks for bringing that up. I've been banging my head against the C++ version, so it'll be helpful to see it from a different perspective.

WillEngler commented 9 years ago

@BurntSushi I said I was hoping to get

  1. Fitting a new instruction type into the main compiler loop
  2. Getting lookarounds to work.

working by this Sunday. That was pure hubris. Here's my true status:

In RE2, rune* compilation is mostly handled in a big, recursive function with a lot of side effects that mutate the compiler. I was following that approach, but I had a hard time reasoning about all of the state transformations. And when I thought about how I would write tests for it, I cried a little.

So I'm taking a different tack now. The module I'm working on exposes one main function: RuneRange::to_br_tree that compiles a rune range to an intermediate form, a BRTree (ByteRangeTree). A BRTree is a binary tree where each node either contains a ByteRange or a split into two branches. So it's pretty close to what the compiled instructions will look like.

As a result, the rune compiler module consists mostly of pure and almost-pure functions that should (hopefully) be straightforward for me to test and debug. Also, I like how it keeps the innards of compiler.rs away from rune compilation. The downside, of course, is that the indirection of BRTree sacrifices performance. If the performance hit proves to be intolerable, it shouldn't be too tough to do it rsc's way later.

Now I'm working to get rune -> BRTree working and presentable. When I finish, I think that will be a nice milestone for a code review (if you would be so kind @BurntSushi).

(*) Ken Thompson and friends like calling Unicode scalar codepoints "runes." I like it, but it isn't commonly used. If you think it will cause more confusion than it's worth, I can not use "rune" in the codebase.

BurntSushi commented 9 years ago

Sounds great. I've also found RE2's compiler difficult to follow. Pulling out the byte range stuff so it can be tested sounds fantastic. As long as performance stays reasonable (even on large Unicode classes), then I'm happy. Make sure to test it on debug builds too. When I first introduced regex-syntax, the case folding code was pretty reasonable on release builds, but it was gratuitously slow on debug builds. So I had to optimize it a bit, but it's still a little slow.

RE rune: I like rune too (being a gopher myself), but the Rust ecosystem unfortunately doesn't use that terminology, so it's probably best to avoid it in the code. Something like cp, or cpoint or codepoint or point or scalar should be fine. (Ignoring the subtle distinction between codepoint and scalar value... sigh)

WillEngler commented 9 years ago

Thanks, I wouldn't have thought to give special attention to debug builds. And roger, I'll get rid of rune.

WillEngler commented 9 years ago

@BurntSushi: Update: getting really close to having a working ScalarRange ->.byte_range::Tree compiler.

The structure of the new module is about where I want it. The one big roadblock left is in some helper functions I've written to test it. As a QuickCheck property, I assert that the transformation from a ScalarRange to a byte_range::Tree is invertible. That is, given a ScalarRange, you can convert it to a Tree and then convert the Tree back to the same ScalarRange. QuickCheck is finding examples where my Tree -> ScalarRange function is overflowing the stack. Fortunately, ScalarRange -> Tree (the important direction) isn't causing trouble. Nor is QuickCheck finding examples where the property doesn't hold (so far). Also, unit test coverage is decent.

BurntSushi commented 9 years ago

@WillEngler Awesome! For Tree -> ScalarRange, you might want to try using an explicit stack if you care enough about passing that test.

Hmm, looking at the code, the Tree data type is scaring me a little. Is that intended to be embedded in Inst? Or is it an intermediate data structure? If the former, my hope was to encode the tree in terms of existing instructions (like Split, which is I think how RE2 does it). In particular, it would replace both the Char and Ranges instructions. If the latter, then that is unfortunately pretty heavyweight, but I think I'd be willing to live with it for now.

BurntSushi commented 9 years ago

@WillEngler Do you have a sense for how big Tree gets?

WillEngler commented 9 years ago

@BurntSushi Thanks for the feedback.

For Tree -> ScalarRange, you might want to try using an explicit stack if you care enough about passing that test.

Yep, I got that done this afternoon. Now Tree::to_scalar_range() is big and ugly, but it works.

the Tree data type is scaring me a little.

Hehehe

Is that intended to be embedded in Inst? Or is it an intermediate data structure? If the former, my hope was to encode the tree in terms of existing instructions (like Split, which is I think how RE2 does it). In particular, it would replace both the Char and Ranges instructions. If the latter, then that is unfortunately pretty heavyweight, but I think I'd be willing to live with it for now.

I intended it as an intermediate data structure. Trying to look at it with fresh eyes, I see what you mean about it being heavyweight. I like the idea of encoding it directly in terms of the existing instructions rather than defining an entirely new type that the compiler needs to worry about and then break down. I'm pretty sure you suggested that before, but it's only really clicking now that I'm knee-deep in the problem.

I would much rather make big changes than go for a solution that is merely "liveable". So if it sounds good by you, I'd like to gut Tree and encode directly into the existing instructions. It won't have been effort wasted; I'm now confident that the fiddly bit-twiddling logic works as intended.

Do you have a sense for how big Tree gets?

It can get real big. More detailed answer forthcoming, but for now I'll say that suffix caching will help a lot.

BurntSushi commented 9 years ago

It can get real big.

I see. I think that's enough of a reason to say, "Yeah, don't use it as an intermediate data structure." (Unless suffix caching significantly bounds the size of the tree.)

With that said, you may find it more satisfying to mush on, get the instruction set updated and work on the matching engines. Then circle back around and fix up the Tree code. Up to you though. :-)

WillEngler commented 9 years ago

I see. I think that's enough of a reason to say, "Yeah, don't use it as an intermediate data structure."

I'm with you. I got wrapped up and couldn't see the forest for the... well, you know.

With that said, you may find it more satisfying to mush on...

As much as I love a good mush, it would drive me nuts to leave this module in a sloppy state. So I'll put off matching for a bit longer.

BurntSushi commented 9 years ago

@WillEngler As a heads up, I've made some progress on this issue. In particular, a crate for generating UTF-8 ranges from scalar values: http://burntsushi.net/rustdoc/utf8_ranges/ I also have a basic DFA construction working (using utf8-ranges), although it doesn't have zero-width assertions or match priority (because they aren't really necessary in this particular context): https://github.com/BurntSushi/fst/blob/master/src/regex/dfa.rs --- Still lots of work to do before it's in good enough shape for regex though.

Also, I know @jneem has been doing a lot of work in this space, including possibly getting word boundaries working in a DFA. I'm not sure what their plans are though!

jneem commented 9 years ago

Yes, I do have word boundaries working in a DFA (and my proof-of-concept fork integrating it into this crate passes all the tests). But I'm taking the offline-compilation approach (maybe similar to the one of @carllerche), and my compilation times are pretty terrible right now, especially after incorporating the utf8_ranges crate to go from a char-based machine to a byte-based machine.

WillEngler commented 9 years ago

Thanks for the heads-up @BurntSushi! (And good timing: I've settled in after some big life changes and I'm able to carve out time for Rust again) I'll check out the progress that you and @jneem have made and try to figure out where I can be of help.

WillEngler commented 9 years ago

Nope, I was wrong. I still don't have the time to make an impact here. Thanks a million @BurntSushi for guiding me. I regret I didn't contribute much in return. I will return to Rust some day!

BurntSushi commented 9 years ago

@WillEngler No regrets, please. :-) You actually made quite a large impact. At least for me personally, I learned a lot from our exchanges because you forced me to give more thought to DFAs than I had previously! I look forward to working with you in the future!