Closed GoogleCodeExporter closed 9 years ago
Some more examples on how merging prefixes/suffixes can reduce NFA size, and on
optimizing the DFA itself:
http://www.owlnet.rice.edu/~comp412/Lectures/L06DFAMin-1up.pdf
Some more examples of REs that could be minimized/optimized by RE2 but they are
not:
"(?:a|b)*abb" -> "^b*aa*bb" (11 -> 6 states, minimal would be 4 states; prog
10->12)
"^a(?:b|c)" (4 states, minimal would be just 2)
"^r0|r1|r2|r3|r4|r5|r6|r7|r8|r9" -> "^r[0-9]" (7 -> 4 states, minimal is just
2; prog: 34 -> 5)
I understand that RE2 tries to avoid building the full DFA (and instead use the
DFA as a cache), which is good as in general the DFA could be exponential in
size.
However if the DFA is small enough it could try to build it and minimize it.
The PDF I linked above describes an interesting idea: instead of building a DFA
and then minimizing it (using Hopcroft's algorithm) you could use subset
construction to build the DFA, and do it twice. You would get a minimal DFA. Of
course this doesn't mean that the resulting DFA cannot be exponential in size,
but you could stop if you detect exponential blowup/reach memory limit.
Original comment by edwinto...@gmail.com
on 23 Jun 2010 at 9:14
I don't believe RE2 will do anything that involves building the entire DFA.
It's just too expensive and too hard to draw the line.
I do agree that common prefixes should be factored out.
I wrote some code to do that a few weeks ago but still need
to clean it up a bit before I can integrate it into the standard
RE2 distribution.
It's also important to remember that for RE2's purposes, the
path taken to find the match matters. common1|common2
can be optimized into common[12], but common1|.*foo.*|common2
cannot be optimized, because doing so would lose the information
about which match is preferred.
Russ
Original comment by rsc@google.com
on 24 Jun 2010 at 5:30
I think you will find that with the recent changes
common prefixes are handled much better.
Original comment by rsc@swtch.com
on 16 Jul 2010 at 4:57
Original issue reported on code.google.com by
edwinto...@gmail.com
on 23 Jun 2010 at 8:24Attachments: