jflex-de / jflex

The fast scanner generator for Java™ with full Unicode support
http://jflex.de
Other
586 stars 114 forks source link

Nested Bounded Repetitions causes "Illegal Argument Exception" #952

Closed hellotommmy closed 1 year ago

hellotommmy commented 2 years ago

Running jflex on the following code snippet,

%%

%class search
%standalone

%% 
((a|b )*b.{10}){3} {System.out.printf("*** found match");}

And I get the error:

147972 states before minimization, 79107 states in minimized DFA
Old file "search.java" saved as "search.java~"
Writing code to "search.java"

Unexpected exception encountered. This indicates a bug in JFlex.
Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new

character value expected
java.lang.IllegalArgumentException: character value expected
        at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105)
        at jflex.generator.CountEmitter.emit(CountEmitter.java:116)
        at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530)
        at jflex.generator.Emitter.emit(Emitter.java:1369)
        at jflex.generator.LexGenerator.generate(LexGenerator.java:115)
        at jflex.Main.generate(Main.java:320)
        at jflex.Main.main(Main.java:336)

As suggested, I report this error.

Can I also ask a question together with this issue report: I understand that jflex uses the Thompson construction to create an NFA first, and then determinises it, and minimises the DFA. I am wondering if these are the main methods for making jflex fast, or are there any other optimizations involved? Are these methods documented somewhere other than the user manual? Can someone give a pointer if possible?

Thanks a lot!

lsf37 commented 2 years ago

Thanks for the report, will look into it.

That looks like a very high number of states for this small expression. Is this the output for the reduced spec or for a larger spec?

The speed is based in a large part on the execution engine being based on a DFA. It doesn't matter too much by which way we arrive at the DFA (it does matter for generator speed, but not runtime speed).

The other part of the speed is that the DFA implementation in the runtime engine is reasonably highly optimised. Those latter kinds of optimisations did make a difference (e.g. compared to the earlier tool JLex), and are not really documented much. They are mostly based on things like trying to make it easy for the JVM to keep values in registers and in the cache, which are all things that can easily go out of date as JVMs develop, so it is quite possible that there is additional optimisation possible, because I haven't done much to the runtime engine in the last few years.

lsf37 commented 1 year ago

To answer my own question from above: this is the minimal spec, it does indeed generate a DFA with 79k states. This is likely the cause of the exception, i.e. there is something going wrong in the table output.

lsf37 commented 1 year ago

What is wrong in the table output is that the table is encoded as a Java string to circumvent java class size restrictions (other methods would have failed earlier). Java strings have 16 bit characters, which means that state values > 0xFFFF are outside the permitted value range. This implicitly limits the max scanner size to 64k states.

Usually 64k states is very hard to hit, but DFA size is known to be potentially exponential in the regexp size, and this is such a (rare) case (the mix of * and longish fixed-length repetition blows up the size).

It's probably better to try to rewrite the expression to lead to a smaller DFA, but it'd be reasonably easy to extend the encoding such that larger sizes are possible without major performance impact.