goccmack / gocc

Parser / Scanner Generator
Other
603 stars 48 forks source link

Brace use in productions. #108

Open kfsone opened 3 years ago

kfsone commented 3 years ago

Tripped myself on subtle difference between token and production definitions :)

identifier: 'a'-'z' { 'a'-'z' };

File: Enum;
Enum: "enum" identifier "{" { identifier } "}";

->

Parse error: Error: {(13) { @ 4:29, expected one of: | ; g_sdt_lit prodId tokId string_lit

which is the unquoted open brace:

Enum: "enum" identifier "{" { identifier } "}";
                             ^
12345678901234567890123456789

Obviously, I need to use the alternate structure here, but I'm just curious if it wouldn't actually just make sense to have that effect achieved by introducing the use of '{' in productions anyway?

awalterschulze commented 3 years ago

Good question :)

It actually was considered to add a zero or more type operator to the parse production rules, but this made it more complicated to specify the syntax direction translation rules, so for the sake of simplicity this was left out.

kfsone commented 3 years ago

Aye, in the process of converting my grammar over from my homebrew parser, I found the absence of square-bracket a little more frustrating. It felt like something that ought to be feasible as syntactic sugar... I.e writing:

    Element: Value [ "," ];

could be treated as

    Element: Value "," | Value;
awalterschulze commented 3 years ago

This is a great example of where the SDT rules would be awkward. Try to include SDT rules in your examples, maybe I am wrong.

kfsone commented 3 years ago

[edit: after reading the gocc2.bnf I'm guesisng 'sdt' specifically refers to the <<...>> directives; I'll write a follow-up]

Sure, something like this?

List: Element | List Element;
Element: Value "," | Value;

I'll try to swing back to this and look at the code so I can see if how I'm now thinking it might be implemented is feasible, but

R: P [ n ];

would effectively be internally mapped to

R: __R0 | __R1;
__R0: P n;
__R1: P;

// so my example
List: Element | List Element;
Element: Value "," | Value;

// becomes
List: Element | List Element;
Element: Value [ "," ];

// produces the same result as
List: Element | List Element;
Element: __Element0 | __Element1;
__Element0: Value ",";
__Element1: Value;

Pardon my oafishness - self-taught and aside from toy parsers/compilers for small dsls I haven't worked on a real parser in anger since I wrote a mud language+engine where the compiler produced an abstract grammar that the engine subsequently used to drive a bottom-up parser to interpret player input ('plant the big plant in the little plant pot and pot the little plant with the big potted plant' [spot the catch :)]).

kfsone commented 3 years ago

After reading the gocc2, I think you're referring to trying to capture the "optional" field in a production:

ClassDef: "class" identifier OptionalParent Body << ast.NewClass($1, $2, $3) >>;
OptionalParent : ":" identifier | empty;

vs

ClassDef: "class" identifier [ Parent ] Body << ast.NewClass($1, $??, $??) >>

If "[...]" is replaced with a logical substitute, then "[ Parent ]" would remain $2 regardless, it would just have a nil value when none was provided, so it would still be treated exactly as

ClassDef : "class" identifier __optional__Parent Body  <<  ast.NewClass($1, $2, $3) >>;

__optional__Parent : Parent << $0, nil >> | empty << nil, nil >>;

The precedent for this is "anonymous terminals", where gocc allows

ClassDef: "class" ...

instead of requiring

class_keyword: "class";

ClassDef: class_keyword identifier ...
kfsone commented 3 years ago

I can see cases where a naive approach would cause problems:

// looking at you, Guido.
import : "import" [ identifier string_lit | string_lit "as" identifier ];

obvious but flawed workarounds:

or:

awalterschulze commented 3 years ago

I'm guesisng 'sdt' specifically refers to the <<...>> directives

Yes exactly

awalterschulze commented 3 years ago

After reading the gocc2, I think you're referring to trying to capture the "optional" field in a production:

ClassDef: "class" identifier OptionalParent Body << ast.NewClass($1, $2, $3) >>;
OptionalParent : ":" identifier | empty;

vs

ClassDef: "class" identifier [ Parent ] Body << ast.NewClass($1, $??, $??) >>

If "[...]" is replaced with a logical substitute, then "[ Parent ]" would remain $2 regardless, it would just have a nil value when none was provided, so it would still be treated exactly as

ClassDef : "class" identifier __optional__Parent Body  <<  ast.NewClass($1, $2, $3) >>;

__optional__Parent : Parent << $0, nil >> | empty << nil, nil >>;

The precedent for this is "anonymous terminals", where gocc allows

ClassDef: "class" ...

instead of requiring

class_keyword: "class";

ClassDef: class_keyword identifier ...

I think this might work for the optional case, not sure about all implications, but at least SDT rules look nice.

awalterschulze commented 3 years ago

I can see cases where a naive approach would cause problems:

// looking at you, Guido.
import : "import" [ identifier string_lit | string_lit "as" identifier ];

obvious but flawed workarounds:

  • pad the attrib count to match worst case, let the user figure the conext themselves: lots of surprises for beginners :(
  • require each branch have the same attrib count: will have arbitrary usage feel and still surprise users with order of params,

or:

  • disallow | in Lexical []s: it's a small but incredibly useful convenience for a lot of super-common cases, the effect on attribs is relatively predictable for learners.

Yes I think already | is only allowed at the top level in the parser part of the bnf, so then this shouldn't be a problem.