softdevteam / grmtools

Rust grammar tool libraries and binaries
Other
507 stars 31 forks source link

error when using fake UMINUS token in %prec directive #399

Closed yuanweixin closed 1 year ago

yuanweixin commented 1 year ago

In yacc, it is possible to declare a fake token with the highest precedence,

/* ... above are other precedence decls with lower prec */
%left UMINUS

then in the grammar, use it to give highest precedence to a unary minus expression:

"-" exp %prec UMINUS

When I try this with grmtools, it complains that %prec not followed by token name. If I try to fix it by introducing the token with %token UMINUS, I would get

Error: these tokens are referenced in the grammar but not defined in the lexer:
  UMINUS

So it seems the intent here is to force a 1-1 correspondence btw the lexer and parser's space of tokens. But that leaves no leeway for this feature.

I checked the test cases for the prec overrides, and it seems to not cover this usage.

Since this is not documented in the yacc-compatibility, I assume this is a bug. And, I'd argue it should be a feature, because without it, I would have to restructure my grammar. The yacc-compatibility is one of the reasons that drew me to use grmtools.

Complete example: .l file

%x COMMENT

%%
[\n] ;
[ \t\r\n] ;
import "IMPORT"
primitive "PRIMITIVE"
class "CLASS"
extends "EXTENDS"
method "METHOD"
new "NEW"
type "TYPE"
var "VAR"
function "FUNCTION"
break "BREAK"
of "OF"
end "END"
in "IN"
nil "NIL"
let "LET"
do "DO"
to "TO"
for "FOR"
while "WHILE"
else "ELSE"
then "THEN"
if "IF"
array "ARRAY"
:= "ASSIGN"
\| "OR"
& "AND"
>= "GE"
\<= "LE"
> "GT"
\< "LT"
= "EQ"
\<> "NEQ"
/ "DIVIDE"
\* "TIMES"
- "MINUS"
\+ "PLUS"
\. "DOT"
\} "RBRACE"
\{ "LBRACE"
\] "RBRACK"
\[ "LBRACK"
\) "RPAREN"
\( "LPAREN"
: "COLON"
; "SEMICOLON"
, "COMMA"
"(?:\\\\|\\"|\\a|\\b|\\f|\\n|\\r|\\t|\\v|\\[0-3][0-7][0-7]|\\x[A-Fa-f0-9][A-Fa-f0-9]|[[:print:]])*" "STRING"
<COMMENT,INITIAL>/\*    <+COMMENT>;
<COMMENT>.              ;
<COMMENT>\n             ;
<COMMENT>\*/            <-COMMENT>;
[0-9]+ "INT"
([a-zA-Z][a-zA-Z_0-9]*|_main) "ID"

.y file

%start program
%right "OF"
%nonassoc "DO" "THEN"
%nonassoc "ELSE"
%nonassoc "ASSIGN"
%left "OR"
%left "AND"
%nonassoc "EQ" "NEQ" "LT" "LE" "GT" "GE"
%left "PLUS" "MINUS"
%left "TIMES" "DIVIDE"
%left "UMINUS"

%%
program :
    exp
  | chunks
  ;

/* === Expressions. === */
exps :
    exp 
    | exps "SEMICOLON" exp 
    ;

exp :
  /* Literals. */
    "NIL"
  | "INT"
  | "STRING"

  /* Array and record creations. */
  | type_id "LBRACK" exp "RBRACK" "OF" exp
  | type_id "LBRACE" field_value_list "RBRACE"

  /* Variables, field, elements of an array. */
  | lvalue

  /* Function call. */
  | "ID" "LPAREN" args "RPAREN"

  /* Operations. */
  | "MINUS" exp %prec "UMINUS"
  | exp "OR" exp
  | exp "AND" exp 
  | exp "EQ" exp 
  | exp "NEQ" exp 
  | exp "LT" exp 
  | exp "LE" exp 
  | exp "GT" exp 
  | exp "GE" exp 
  | exp "PLUS" exp 
  | exp "MINUS" exp 
  | exp "TIMES" exp 
  | exp "DIVIDE" exp 
  | "LPAREN" exps "RPAREN"

  /* Assignment. */
  | lvalue "ASSIGN" exp

  /* Control structures. */
  | "IF" exp "THEN" exp
  | "IF" exp "THEN" exp "ELSE" exp
  | "WHILE" exp "DO" exp
  | "FOR" "ID" "ASSIGN" exp "TO" exp "DO" exp
  | "BREAK"
  | "LET" chunks "IN" exps "END"
  ;

field_value_list : /* Empty */
    | "ID" "EQ" exp 
    | field_value_list "COMMA" "ID" "EQ" exp 
    ;

args : /* Empty */
    | args_helper
    ;

args_helper :
    exp 
    | args_helper "COMMA" exp
    ;

lvalue :
    "ID"
  /* Record field access. */
  | lvalue "DOT" "ID"
  /* Array subscript. */
  | lvalue "LBRACK" exp "RBRACK"
  ;

/* === Chunks of declarations. === */
chunks : /* Empty */
  | chunk
  | chunks chunk
  ;

chunk :
  tydec 
  | fundec
  | vardec
  | "IMPORT" "STRING"
  ;

/* Variable declaration. */
vardec : 
    "VAR" "ID" "ASSIGN" exp 
    | "VAR" "ID" "COLON" type_id "ASSIGN" exp 
    ;

/* Type declaration. */
tydec : "TYPE" "ID" "EQ" ty 
;

/* Function declaration. */
fundec :
    "FUNCTION" "ID" "LPAREN" tyfields "RPAREN" "EQ" exp
    "FUNCTION" "ID" "LPAREN" tyfields "RPAREN" "COLON" type_id "EQ" exp
  | "PRIMITIVE" "ID" "LPAREN" tyfields "RPAREN" 
  | "PRIMITIVE" "ID" "LPAREN" tyfields "RPAREN" "COLON" type_id 
  ;

/* === Types. === */
ty :
   /* Type alias. */
     type_id
   /* Record type definition. */
   | "LBRACE" tyfields "RBRACE"
   /* Array type definition. */
   | "ARRAY" "OF" type_id
   ;

tyfields : /* Empty */
    | tyfields_helper
    ;

tyfields_helper :
    "ID" "COLON" type_id 
    | tyfields_helper "COMMA" "ID" "COLON" type_id 
    ;

type_id : "ID" ;
%%
ratmice commented 1 year ago

One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS", probably making sure that it is at the end. Which should in theory make a token which never matches anything.

Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of . with respect to the inclusion of newlines with grmtools including newlines in . and lex not including them. So I believe it would never match anything on grmtools. But on lex it might match a newline unless newline is already matched such as is the case where you match whitespace early in lex file. So given those stipulations I believe it should behave the same between grmtools/lex.

yuanweixin commented 1 year ago

One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS", probably making sure that it is at the end. Which should in theory make a token which never matches anything.

Good idea. I am so stuck in the yacc mindset that it is hard for me to switch and think outside the box.

For posterity, I had to do 2 things: first putting in [^.] "UMINUS in the .l file. then add a %token "UMINUS" in the .y file, for the error to go away.

Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of . with respect to the inclusion of newlines with grmtools including newlines in . and lex not including them.

Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says . any character except new line (includes new line with s flag), so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.

Edit: For posterity, I had to do 2 things: first putting in - "UMINUS in the .l file. then add a %token "UMINUS" in the .y file, for the error to go away.

ratmice commented 1 year ago

Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says . any character except new line (includes new line with s flag)`, so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.

regex is configurable whether dot accepts newline via https://docs.rs/regex/latest/regex/struct.RegexBuilder.html#method.dot_matches_new_line called here: https://github.com/softdevteam/grmtools/blob/master/lrlex/src/lib/lexer.rs#L59 So I really think it should match no characters on grmtools, even if you do not have a token matching newline.

I didn't see this listed in the lex compatibility section of the docs, (And i'm working solely off of memory for lex's dot newline behavior). But maybe this is something that should be documented here. I'll try and test lex out then open a PR for the docs change if my memory is correct. In that pr perhaps we should investigate whether there is any difficulty in making a CTLexerBuilder config option for dot newline behavior.

ratmice commented 1 year ago

Testing of flex confirmed that [.] does not match newline, however it also seems to show that [^.] rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?

%%
[^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');}
%%
int yywrap(){return(1);}
main(int argc, char **argv){yylex();}
ltratt commented 1 year ago

AFAIK, lex/flex predate the "semi-standarisation" of regexs that occurred after Perl-style regexs started to dominate. It's why things like vim, Unix's re_format, and so on have what to us seem to be "weird" regex formats, each a little different.

I did hope -- but I know I won't get around to! -- that someone might implement a "proper" lex/flex compatibility mode for lrlex that implements that particular style of regexs. It's not that I think it's a particularly good format, but it will mean that we could move from "we do whatever Rust's regex engine does" to "we run lex files the same as any other lex".

In the specific case we have here, I'm a bit neutral about whether . should match newlines or not. I lean towards thinking that both behaviours have surprises, and that breaking backwards compatibility just to opt into another surprise might not be worth it, but I can be convinced otherwise.

yuanweixin commented 1 year ago

Testing of flex confirmed that [.] does not match newline, however it also seems to show that [^.] rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?

%%
[^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');}
%%
int yywrap(){return(1);}
main(int argc, char **argv){yylex();}

I think . is treated as a normal character inside character classes. For flex you could specify [^\x00-\xFF] to negate everything.

In my usage, instead of using a catch-all like [^.], I think I can use - for UMINUS, since - is already used for PLUS earlier, so UMINUS should never match.

ratmice commented 1 year ago

@yuanweixin Nice! that sounds like the best option and avoids all these of compatibility woes.

@ltratt Yeah, I probably won't get around to implementing a full lex compatible regex engine either (fond of being able to just use rust regex instead). I not really too keen on changing the default newline behavior either. I guess I was just entertaining the idea of exposing RegexBuilder options to CTLexerBuilder. Which can increase compatibility in this regard without much effort. But with the above seemingly sorted there might not be anyone needs it. Anyway i'll happily post a patch for that if it seems useful.

ltratt commented 1 year ago

I guess I was just entertaining the idea of exposing RegexBuilder options to CTLexerBuilder.

That could be interesting!

ltratt commented 1 year ago

Closing this in favour of #400.