enhancement: should optimize common prefix

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Compile these 2 regexes: "common1|common2", "common(?:1|2)"
2. They should be equivalent to "common[1-2]"

What is the expected output? What do you see instead?
They should have same prog size and full DFA size.
The DFA is the same, but the prog size is larger for the first one.

What version of the product are you using? On what operating system?
7898bbcaf58f tip, Debian Linux 2.6.34 x86_64

Please provide any additional information below.
I think that Regexp::Simplify(), or Prog::Optimize should factor out
common prefix/common suffixes to reduce prog size.
For such a small regex as in this example it won't affect the speed much,
but for larger regexes it might.

original regex: common1|common2
re.Regexp()->ToString(): common1|common2
prog:
15. alt -> 1 | 8
1. byte [63-63] -> 2
8. byte [63-63] -> 9
2. byte [6f-6f] -> 3
9. byte [6f-6f] -> 10
3. byte [6d-6d] -> 4
10. byte [6d-6d] -> 11
4. byte [6d-6d] -> 5
11. byte [6d-6d] -> 12
5. byte [6f-6f] -> 6
12. byte [6f-6f] -> 13
6. byte [6e-6e] -> 7
13. byte [6e-6e] -> 14
7. byte [31-31] -> 16
14. byte [32-32] -> 16
16. match!

prog size: 19, full DFA states: 9
original regex: common(?:1|2)
re.Regexp()->ToString(): common[1-2]
prog:
1. byte [63-63] -> 2
2. byte [6f-6f] -> 3
3. byte [6d-6d] -> 4
4. byte [6d-6d] -> 5
5. byte [6f-6f] -> 6
6. byte [6e-6e] -> 7
7. byte [31-32] -> 8
8. match!

prog size: 11, full DFA states: 9

Original issue reported on code.google.com by edwinto...@gmail.com on 23 Jun 2010 at 8:24

Attachments:

t.cpp

GoogleCodeExporter commented 9 years ago

Some more examples on how merging prefixes/suffixes can reduce NFA size, and on 
optimizing the DFA itself:
http://www.owlnet.rice.edu/~comp412/Lectures/L06DFAMin-1up.pdf

Some more examples of REs that could be minimized/optimized by RE2 but they are 
not:
"(?:a|b)*abb" -> "^b*aa*bb" (11 -> 6 states, minimal would be 4 states; prog 
10->12)
"^a(?:b|c)" (4 states, minimal would be just 2)
"^r0|r1|r2|r3|r4|r5|r6|r7|r8|r9" -> "^r[0-9]" (7 -> 4 states, minimal is just 
2; prog: 34 -> 5)

I understand that RE2 tries to avoid building the full DFA (and instead use the 
DFA as a cache), which is good as in general the DFA could be exponential in 
size.
However if the DFA is small enough it could try to build it and minimize it.

The PDF I linked above describes an interesting idea: instead of building a DFA 
and then minimizing it (using Hopcroft's algorithm) you could use subset 
construction to build the DFA, and do it twice. You would get a minimal DFA. Of 
course this doesn't mean that the resulting DFA cannot be exponential in size, 
but you could stop if you detect exponential blowup/reach memory limit.

Original comment by edwinto...@gmail.com on 23 Jun 2010 at 9:14

GoogleCodeExporter commented 9 years ago

I don't believe RE2 will do anything that involves building the entire DFA.
It's just too expensive and too hard to draw the line.

I do agree that common prefixes should be factored out. 
I wrote some code to do that a few weeks ago but still need
to clean it up a bit before I can integrate it into the standard
RE2 distribution.

It's also important to remember that for RE2's purposes, the
path taken to find the match matters.  common1|common2 
can be optimized into common[12], but common1|.*foo.*|common2
cannot be optimized, because doing so would lose the information
about which match is preferred.

Russ

Original comment by rsc@google.com on 24 Jun 2010 at 5:30

GoogleCodeExporter commented 9 years ago

I think you will find that with the recent changes 
common prefixes are handled much better.

Original comment by rsc@swtch.com on 16 Jul 2010 at 4:57

Changed state: Fixed

tst2005googlecode / re2

enhancement: should optimize common prefix #21