skvadrik / re2c

Lexer generator for C, C++, Go and Rust.
https://re2c.org
Other
1.09k stars 169 forks source link

very slow DFA construction (resulting in a very large DFA) #128

Closed skvadrik closed 8 years ago

skvadrik commented 8 years ago

Consider the following code (slow.re):

/*!re2c
    [ac]{0,12} [a] [ac]{0,12} {}
*/

It takes re2c a long time to compile this code and the generated lexer is very large:

$ time re2c slow.re > slow.c && stat -c '%s' slow.c

real    0m2.764s
user    0m2.743s
sys     0m0.022s
1384716

Increasing repetition counter (any of the two, or both):

/*!re2c
    [ac]{0,14} [a] [ac]{0,14} {}
*/

Causes insane increase of time and size:

$ time re2c slow.re > slow.c && stat -c '%s' slow.c

real    1m54.837s
user    1m54.733s
sys     0m0.120s
5627102

If we minimize the generated DFA, it would get reasonably small. (I experimented with minimization a bit; the number of states is reduced by several orders of magnitude and the minimization itself takes reasonable time, which suggests that the slowdown might be caused by inefficiency in FSA determinisation).

skvadrik commented 8 years ago

Assuming the same example:

/*!re2c
    [ac]{0,14} [a] [ac]{0,14} {}
*/

Commit https://github.com/skvadrik/re2c/commit/b98997b6fed91b6c37b1517a1aae096fb8c770fc reduces the time of DFA construction from ~115s to ~1.2s:

$ time ./re2c slow.re > slow.c && stat -c '%s' slow.c

real    0m1.139s
user    0m1.073s
sys     0m0.065s
5646703

The generated DFA is still too big and should be minimized.

skvadrik commented 8 years ago

DFA size is reduced by commit https://github.com/skvadrik/re2c/commit/1136f2fb007461fc0c720fb6a1a9f58efed691e1 that adds DFA minimization.

$ time ./re2c slow.re > slow.c && stat -c '%s' slow.c

real    0m0.732s
user    0m0.684s
sys     0m0.048s
15078
skvadrik commented 8 years ago

The intermediate DFA (the one that is constructed by NFA determinization and further reduced by DFA minimization) is still large. We'll need incremental minimization to fully solve pathological cases like this. Closing the bug until someone encounters a real-world example.