jflex-de / jflex

The fast scanner generator for Java™ with full Unicode support
http://jflex.de
Other
581 stars 114 forks source link

Inconsistent rules matching #1100

Closed nd closed 1 year ago

nd commented 1 year ago

I get a situation where lexer works differently depending on an input and I think this might be a bug.

The lexer has an option to treat '#' either as a separate token or as a start of end of line comment (hashStartsComment).

When # is encountered and it should start a comment, I push it back so that it is included in the comment, and enter a state for the comment.

What I observe is that the lexer works fine for lines like #comment, but hangs for lone # on a line.

My understanding is that after I pushed the # back, the same rule for "#" is matches again entering the endless loop. This is fine.

But why it doesn't enter the endless loop for # comment line? It looks like a bug.

package de.jflex.example.standalone;

%%

%public
%class Subst
%standalone

%unicode

%{
  // tokens:
  static int HASH = 1;
  static int COMMENT = 2;
  static int CHAR = 3;

  boolean hashStartsComment = true;
%}

%state CONFIGURABLE_EOL_COMMENT

%%

"#"                                  { if (hashStartsComment) {yypushback(1); yybegin(CONFIGURABLE_EOL_COMMENT);} else { return HASH; } }
<CONFIGURABLE_EOL_COMMENT> .[^\r\n]* { yybegin(YYINITIAL); return COMMENT; }
.                                    { return CHAR; }
lsf37 commented 1 year ago

It looks to me like this is exactly the correct behaviour.

According to this spec, when the scanner encounters a line with a single # input, the first rule matches, the # is pushed back, and the state CONFIGURABLE_EOL_COMMENT is entered, and nothing is returned, so the scanner continues to match input. The next input is again #, because it was pushed back into the stream. Because we are in state CONFIGURABLE_EOL_COMMENT, two rules can now match. For a line with a single #, both have the same match length, so the rule with higher priority (earlier in the file) is the one that is chosen. This rule again pushes back #, etc.

To break the cycle, you can either put the <CONFIGURABLE_EOL_COMMENT> rule first, or guard the "#" rule with <YYINITIAL> so that it is not available in state <CONFIGURABLE_EOL_COMMENT>.

nd commented 1 year ago

Thanks, I didn't know that match length matters. It indeed explains the behavior.