usethesource / rascal

The implementation of the Rascal meta-programming language (including interpreter, type checker, parser generator, compiler and JVM based run-time system)
http://www.rascal-mpl.org
Other
399 stars 77 forks source link

Getting rid of implode #470

Closed tvdstorm closed 8 years ago

tvdstorm commented 10 years ago

Here's an idea to get rid of implode, with only minor penalty (I think).

Implode transforms parse trees (PTs) to abstract syntax trees (ASTs). To do this, it requires an ADT describing the AST structure. Implode is often complained about and for good reason. The following problems can be observed:

To solve these problems, I suggest the following idea. Get rid of implode and the separate ADT. The grammar defines a single type, just like now, but values of the type have three literal representations:

  1. Concrete syntax in backquotes.
  2. Tree notation with appls and prods
  3. Prefix notation like in add(lhs, rhs) (let's call it "Grammar AST" or GAST notation)

The 3rd item is new: it means you can construct "abstract" concrete syntax trees. As a result, the grammar is enough to allow you to program with ASTs. Concrete and GAST representations can be arbitrarily mixed and nested.

This design has one big consequence: whenever a parse tree is serialized, and a GAST sub-tree is encountered, we need to explode the tree. IOW we might have to invent layout and/or separators. Currently there is no restriction on either Layout or separators, so it's not always possible to do that. Consequently, either you'll get a runtime error ("can't invent layout/sep"), or we impose the restriction statically when using a GAST where no layout/sep can be inferred. Note: this is not about inferring intended layout/seps, just syntactically valid layout/seps. If you need to tweak layout, either you don't use GASTs, or you have to write a pretty printer like all AST people do.

Some observations:

Cons:

Possible issues:

Feedback welcome!

P.S. Isn't it ironic that one might get rid of implode by introducing explode?

tvdstorm commented 10 years ago

Just realized: GASTs are really the dual of concrete syntax patterns/expressions:

Concrete syntax -> statically insert appl-trees through parsing GASTs -> statically insert appl-trees through explode.

So I think there's a story here.

PaulKlint commented 10 years ago

I like the gist of your proposal but we have to work out the details. Complete duality between concrete and abstract views on the same tree is appealing. We also have to work out the efficiency implications: ASTs were supposed to be more efficient. As long as we can do all the required transformations at compile time, we are ok I guess.

We could even generalize a bit further and allow concrete patterns to match abstract trees, provided that the types of the subtrees are "compatible".

tvdstorm commented 10 years ago

We could even generalize a bit further and allow concrete patterns to match abstract trees, provided that the types of the subtrees are "compatible".

I don't like "compatible" in scare-quotes. In my proposal GASTs will match old-skool concrete trees, if that's what you mean. Like so:

add(`1`, `2`) := `1 + 2`

=> True

jurgenvinju commented 10 years ago

I think it would be a good idea to do this and a challenging one at the same time.

On Mon, Dec 30, 2013 at 11:28 PM, Tijs van der Storm < notifications@github.com> wrote:

Here's an idea to get rid of implode, with only minor penalty (I think).

Implode transforms parse trees (PTs) to abstract syntax trees (ASTs). To do this, it requires an ADT describing the AST structure. Implode is often complained about and for good reason. The following problems can be observed:

  • Most of the information in that ADT, is also in the grammar.
  • Implode often (mostly) implements a one-to-one mapping, which feels wasteful and superfluous
  • The ADT should use type and constructor names corresponding to the non-terminals and production labels in the grammar, which is brittle.
  • Module sophistry is required to prevent non-terminal types from clashing with ADT types.

To solve these problems, I suggest the following idea. Get rid of implode and the separate ADT. The grammar defines a single type, just like now, but values of the type have three literal representations:

  1. Concrete syntax in backquotes.
  2. Tree notation with appls and prods
  3. Prefix notation like in add(lhs, rhs) (let's call it "Grammar AST" or GAST notation)

The 3rd item is new: it means you can construct "abstract" concrete syntax trees. As a result, the grammar is enough to allow you to program with ASTs. Concrete and GAST representations can be arbitrarily mixed and nested.

This design has one big consequence: whenever a parse tree is serialized, and a GAST sub-tree is encountered, we need to explode the tree. IOW we might have to invent layout and/or separators. Currently there is no restriction on either Layout or separators, so it's not always possible to do that. Consequently, either you'll get a runtime error ("can't invent layout/sep"), or we impose the restriction statically when using a GAST where no layout/sep can be inferred. Note: this is not about inferring intended layout/seps, just syntactically valid layout/seps. If you need to tweak layout, either you don't use GASTs, or you have to write a pretty printer like all AST people do.

Some observations:

  • Concrete trees are equal to concrete trees if they are _really_equal, i.e. including layout. (Like now)
  • GASTs are never equal to any concrete tree because they don't have layout.
  • Matching works across all three kinds.
  • If there is no production label, there is no GAST representation, but you have to use concrete syntax or appls.
  • There are now values of the same type, which have different in-memory representations.
  • For pattern matching this proposal is already implemented, more or less (?).

Cons:

  • GASTs can't be more flexible than concrete syntax trees. The mapping is really one-to-one.
  • We lose a little wysiwyg: "ASTs" that print out as source code.

Possible issues:

  • We need grammar information for explode (i.e. productions): where to get them?
  • Explode could occur at unparse-time, or at non-pattern GAST expression evaluation time (i.e. create appl-prod nodes immediately).
  • How to deal with optionals? You could use list notation (like implode currently assumes), but this would break matching I think. Another option is using Maybe, but this suggests Maybe shouldn't be in utilanymore.

Feedback welcome!

P.S. Isn't it ironic that one might get rid of implode by introducing explode?

— Reply to this email directly or view it on GitHubhttps://github.com/cwi-swat/rascal/issues/470 .

Jurgen Vinju

swatbot commented 10 years ago

BTW, the full grammar for every non-terminal is always available using #nonterminal, so that point is tackled.

On Fri, Jan 3, 2014 at 11:36 AM, Jurgen J. Vinju notifications@github.comwrote:

I think it would be a good idea to do this and a challenging one at the same time.

  • It's an extension of the abstract pattern matching to include its mirror image in construction. We could generate a sentence of length 1 for every layout non-terminal as a default, if such a sentence exists, or otherwise default to the shortest possible sentence for the layout non-terminal. Same same for separators in lists. We may also apply the heuristics from pandora's pretty printer instead which recognizes typical programming language constructs such as comma separated lists, brackets and binary operators to infer when a newline is better than a space.
  • abstract pattern matching on concrete trees is also still in its infancy in terms of documentation, parametrization and error messaging...
  • Interesting challenges are in the dealings with lists and string literals i.e. in call("hello",[a,b]) with syntax Exp = call: Id "(" {Exp ","}* ")", we need to know that [a,b] is an abstract notation for a concrete list and this is only provided by the context so has to be inferred by the type checker. Also, "hello" should preferable be parsed as Id, which again adds more inference capabilities to the type checker and more chance of funny error messages...

On Mon, Dec 30, 2013 at 11:28 PM, Tijs van der Storm < notifications@github.com> wrote:

Here's an idea to get rid of implode, with only minor penalty (I think).

Implode transforms parse trees (PTs) to abstract syntax trees (ASTs). To do this, it requires an ADT describing the AST structure. Implode is often complained about and for good reason. The following problems can be observed:

  • Most of the information in that ADT, is also in the grammar.
  • Implode often (mostly) implements a one-to-one mapping, which feels wasteful and superfluous
  • The ADT should use type and constructor names corresponding to the non-terminals and production labels in the grammar, which is brittle.
  • Module sophistry is required to prevent non-terminal types from clashing with ADT types.

To solve these problems, I suggest the following idea. Get rid of implode and the separate ADT. The grammar defines a single type, just like now, but values of the type have three literal representations:

  1. Concrete syntax in backquotes.
  2. Tree notation with appls and prods
  3. Prefix notation like in add(lhs, rhs) (let's call it "Grammar AST" or GAST notation)

The 3rd item is new: it means you can construct "abstract" concrete syntax trees. As a result, the grammar is enough to allow you to program with ASTs. Concrete and GAST representations can be arbitrarily mixed and nested.

This design has one big consequence: whenever a parse tree is serialized, and a GAST sub-tree is encountered, we need to explode the tree. IOW we might have to invent layout and/or separators. Currently there is no restriction on either Layout or separators, so it's not always possible to do that. Consequently, either you'll get a runtime error ("can't invent layout/sep"), or we impose the restriction statically when using a GAST where no layout/sep can be inferred. Note: this is not about inferring intended layout/seps, just syntactically valid layout/seps. If you need to tweak layout, either you don't use GASTs, or you have to write a pretty printer like all AST people do.

Some observations:

  • Concrete trees are equal to concrete trees if they are _really_equal, i.e. including layout. (Like now)
  • GASTs are never equal to any concrete tree because they don't have layout.
  • Matching works across all three kinds.
  • If there is no production label, there is no GAST representation, but you have to use concrete syntax or appls.
  • There are now values of the same type, which have different in-memory representations.
  • For pattern matching this proposal is already implemented, more or less (?).

Cons:

  • GASTs can't be more flexible than concrete syntax trees. The mapping is really one-to-one.
  • We lose a little wysiwyg: "ASTs" that print out as source code.

Possible issues:

  • We need grammar information for explode (i.e. productions): where to get them?
  • Explode could occur at unparse-time, or at non-pattern GAST expression evaluation time (i.e. create appl-prod nodes immediately).
  • How to deal with optionals? You could use list notation (like implode currently assumes), but this would break matching I think. Another option is using Maybe, but this suggests Maybe shouldn't be in utilanymore.

Feedback welcome!

P.S. Isn't it ironic that one might get rid of implode by introducing explode?

— Reply to this email directly or view it on GitHub< https://github.com/cwi-swat/rascal/issues/470> .

Jurgen Vinju

  • Centrum Wiskunde & Informatica - SWAT
  • INRIA Lille - ATEAMS
  • Universiteit van Amsterdam

www: http://jurgen.vinju.org, http://www.rascal-mpl.org, http://twitter.com/jurgenvinju skype: jurgen.vinju

Reply to this email directly or view it on GitHubhttps://github.com/cwi-swat/rascal/issues/470#issuecomment-31515370 .

Jurgen Vinju