softdevteam / grmtools

Rust grammar tool libraries and binaries
Other
494 stars 32 forks source link

Trying to replicate `tinylang.yy` and `tinylang.l` syntax with `nimbleparse` #418

Closed jymchng closed 10 months ago

jymchng commented 10 months ago

Hi, as an exercise to learn more on how to use this excellent crate, I am writing to replicate tinylang.yy and tinylang.l from https://github.com/PacktPublishing/Learn-LLVM-12/tree/master/Chapter04/bisonflex.

Here is my attempt:

tinylang.l

%%
[a-zA-Z_0-9]* "ID"
[0-9] "DIGIT"
[0-9]+ "DIGITS"
[0-9A-F] "HEXDIGIT"
[0-9A-F]+ "HEXDIGITS"
[ \t\r] "SPACE"

\+ "PLUS"
- "MINUS"
\* "MULT"
:= "COLONEQUAL"
. "PERIOD"
, "COMMA"
; "SEMI"
: "COLON"
= "EQUAL"
# "HEX"
\<= "LTEQ"
\>= "GTEQ"
\< "LT"
> "GT"
\( "L_PAREN"
\) "R_PAREN"
AND "AND"
BEGIN "BEGIN"
CONST "CONST"
DIV "DIV"
DO "DO"
END "END"
ELSE "ELSE"
FROM "FROM"
IF "IF"
IMPORT "IMPORT"
MOD "MOD"
MODULE "MODULE"
NOT "NOT"
OR "OR"
PROCEDURE "PROCEDURE"
RETURN "RETURN"
THEN "THEN"
VAR "VAR"
WHILE "WHILE"

tinylang.y

%start goal

%epp ID "<id>"
%epp DIGIT "<digit>"
%epp HEXDIGIT "<hexdigit>"
%epp DIGITS "<digits>"
%epp HEXDIGITS "<hexdigits>"

%epp PLUS "+"
%epp MINUS "-"
%epp MULT "*"
%epp COLONEQUAL ":="
%epp PERIOD "."
%epp COMMA ","
%epp SEMI ";"
%epp COLON ":"
%epp EQUAL "="
%epp HEX "#"
%epp LTEQ "<="
%epp GTEQ ">="
%epp LT "<"
%epp GT ">"

%epp AND "AND"
%epp BEGIN "BEGIN"
%epp CONST "CONST"
%epp DIV "DIV"
%epp DO "DO"
%epp END "END"
%epp ELSE "ELSE"
%epp FROM "FROM"
%epp IF "IF"
%epp IMPORT "IMPORT"
%epp MOD "MOD"
%epp MODULE "MODULE"
%epp NOT "NOT"
%epp OR "OR"
%epp PROCEDURE "PROCEDURE"
%epp RETURN "RETURN"
%epp THEN "THEN"
%epp VAR "VAR"
%epp WHILE "WHILE"

%%
goal: compilationUnit;
compilationUnit
  : "MODULE" "ID" "SEMI" imports block "ID" "PERIOD" ;
imports : %empty | import imports ;
import
  : "FROM" "ID" "IMPORT" identList "SEMI"
  | "IMPORT" identList "SEMI" ;
block : declarations blockBody "END" ;
declarations : %empty | declaration declarations ;
declaration
  : "CONST" constantDeclaration
  | "VAR" variableDeclaration
  | procedureDeclaration "SEMI" ;
constantDeclaration
  : %empty
  | "ID" "EQUAL" expression "SEMI" constantDeclaration ;
variableDeclaration
  : %empty
  | identList "COLON" qualident "SEMI" variableDeclaration ;
procedureDeclaration
  : "PROCEDURE" "ID" formalParameters "SEMI"
    block "ID" ;
formalParameters
  : %empty | "L_PAREN" formalParameterList "R_PAREN" resultType ;
formalParameterList
  : %empty | formalParameter "SEMI" formalParameterList ;
formalParameter
  : varParam identList "COLON" qualident ;
varParam : %empty | "VAR" ;
resultType : %empty | "COLON" qualident ;
blockBody : %empty | "BEGIN" statementSequence ;
statementSequence
  : statement "SEMI" statementSequence | statement ;
statement
  : qualident "COLONEQUAL" expression
  | qualident actualParams
  | ifStatement | whileStatement | "RETURN" returnExpr ;
actualParams  : %empty | "L_PAREN" actualParamList "R_PAREN" ;
actualParamList : %empty | expList ;
returnExpr : %empty | expression ;
ifStatement
  : "IF" expression "THEN" statementSequence
    elseStatement "END" ;
elseStatement
  : %empty | "ELSE" statementSequence ;
whileStatement
  : "WHILE" expression "DO" statementSequence "END" ;
expList
  : expression "COMMA" expList | expression ;
expression
  : prefixedExpression relation prefixedExpression
  | prefixedExpression ;
relation
  : "EQUAL" | "HASH" | "LESS" | "LESSEQUAL" | "GREATER" | "GREATEREQUAL" ;
prefixedExpression : prefixOperator simpleExpression ;
simpleExpression
  : term addOperator simpleExpression | term ;
prefixOperator : %empty | "PLUS" | "MINUS";
addOperator : "PLUS" | "MINUS" | "OR" ;
term : factor mulOperator term | factor ;
mulOperator : "STAR" | "SLASH" | "DIV" | "MOD" | "AND" ;
factor
  : "DIGITS" | "L_PAREN" expression "R_PAREN" | "NOT" factor
  | qualident actualParams ;
qualident : "ID" "PERIOD" qualident | "ID" ;
identList : "ID" "COMMA" identList | "ID" ;

I tried to match directly to what is given in the Learn LLVM 12 repository.

When I ran this command nimbleparse -y original src/tinylang.l src/tinylang.y src/gcd.mod

this is the error(s) I got, each run can yield different errors lol

src/tinylang.y: [Error] Unknown token 'MULT' in %epp declaration at line 1 column 1
src/tinylang.y: [Error] Unknown token 'HEXDIGIT' in %epp declaration at line 1 column 1
src/tinylang.y: [Error] Unknown token 'LT' in %epp declaration at line 1 column 1

gcd.mod is:

MODULE Gcd;

VAR x: INTEGER;

PROCEDURE GCD(a, b: INTEGER) : INTEGER;
VAR t: INTEGER;
BEGIN
  IF b = 0 THEN
    RETURN a;
  END;
  WHILE b # 0 DO
    t := a MOD b;
    a := b;
    b := t;
  END;
  RETURN a;
END GCD;

END Gcd.
ltratt commented 10 months ago

I think the error is because MULT/HEXDIGIT et al. don't appear in the grammar itself. However, the error is definitely confusing because the %epp is suppressing the better error message. If I get rid of the %epp lines I get this:

Warning: these tokens are defined in the lexer but not referenced in the
grammar:
  DIGIT
  GT
  GTEQ
  HEX
  HEXDIGIT
  HEXDIGITS
  LT
  LTEQ
  MULT
  SPACE
Error: these tokens are referenced in the grammar but not defined in the lexer:
  GREATER
  GREATEREQUAL
  HASH
  LESS
  LESSEQUAL
  SLASH
  STAR
ratmice commented 10 months ago

this is the error(s) I got, each run can yield different errors lol

Hmm, I knew this was the case with error recovery that it's errors were non-deterministic, but I wasn't aware of it in a case such as this. One other thing we could look at here is whether we should try and emit both the %epp errors, and the 'unreferenced in the lexer/grammar' warnings here. When I was adding output of multiple errors, I was somewhat conservative in that I was trying to avoid adding new cascading failures.

ltratt commented 10 months ago

One other thing we could look at here is whether we should try and emit both the %epp errors, and the 'unreferenced in the lexer/grammar' warnings here

This would definitely be good, but it's definitely some work.

jymchng commented 10 months ago

I think the error is because MULT/HEXDIGIT et al. don't appear in the grammar itself. However, the error is definitely confusing because the %epp is suppressing the better error message. If I get rid of the %epp lines I get this:

Warning: these tokens are defined in the lexer but not referenced in the
grammar:
  DIGIT
  GT
  GTEQ
  HEX
  HEXDIGIT
  HEXDIGITS
  LT
  LTEQ
  MULT
  SPACE
Error: these tokens are referenced in the grammar but not defined in the lexer:
  GREATER
  GREATEREQUAL
  HASH
  LESS
  LESSEQUAL
  SLASH
  STAR

@ltratt Thank you sir for your response.

For the first error,

> Warning: these tokens are defined in the lexer but not referenced in the
> grammar:
>   DIGIT

How can show me how can I define DIGIT in tinylang.l since it is a regex? The Java7 example wasn't helpful to me. Thank you.

ratmice commented 10 months ago

How can show me how can I define DIGIT in tinylang.l since it is a regex? The Java7 example wasn't helpful to me. Thank you.

These errors are a bit difficult to give general repair advice about. Here tinylang.l produces a token DIGIT, but the parser has no rules for how to consume one, comparing the linked lexer definitions:

[0-9] "DIGIT"
[0-9]+ "DIGITS"

Here, we have inlined the usage of DIGIT on the left hand side, in the original this was

digit    [0-9]
{digit}+      return tinylang::Parser::token::T_integer_literal;

While doing so we have added to the lexer the ability to produce the DIGIT token, it may never produce such a token depending on match precedence. But this token never existed in the original tinylang.l, and has nothing in tinylang.yy which prepares the parser to consume it, subsequently the lexer could produce a token which the parser does not expect.

So in this case I believe it makes sense to just remove the [0-9] "DIGIT" rule from the lexer. But in general the solution for either of these types of errors could involve modifying either or both of the lex and yacc files. E.g. when I typo the token in both.

hope that helps.