How to reconcile concrete and abstract syntax trees (aka getting rid of implode)

usethesource / rascal

The implementation of the Rascal meta-programming language (including interpreter, type checker, parser generator, compiler and JVM based run-time system)

http://www.rascal-mpl.org

Other

402 stars 77 forks source link

How to reconcile concrete and abstract syntax trees (aka getting rid of implode) #762

Closed PaulKlint closed 8 years ago

PaulKlint commented 9 years ago

After repeated discussions (latest today with @tvdstorm) we seem to agree that we have to reconsider how we handle the relation between a concrete and an abstract representation of syntax trees. At the moment we have two options:

Do all processing at the concrete representation. This works fine, the type checker and compiler take this approach and no abstract representation is needed.
Introduce a data type for your AST and map the concrete tree to this data type. One can do this manually (work!) or automatically using implode (error prone! since it may be impossible to map the concrete syntax to your AST data type).

A third alternative is to always work on the concrete syntax tree but give the illusion of working on an abstract version. We have discussed this before and called this feature "overlay" or "view". The main point of this issue is that we are nearly there. This approach requires support for abstract patterns that match concrete subjects. For example, given a syntax rule with constructor name mul (see example below) the following will match:

mul(Exp e1, Exp e2) := [Exp] "3*4"`.

Here is a complete example (runnable using the compiler, not yet fully supported by the interpreter):

module Experiment
import ParseTree;
import IO;
import String;

// Exp grammar
lexical LAYOUT = [\t-\n\r\ ];                    
layout LAYOUTLIST = LAYOUT*  !>> [\t-\n\r\ ] ;     
lexical IntegerLiteral = [0-9]+;           

start syntax Exp =                         
    con: IntegerLiteral 
  | bracket "(" Exp ")"     
  > left mul: Exp "*" Exp 
  > left add: Exp "+" Exp
  ;

// Evaluation rules that use abstract patterns to match concrete trees
public int eval(con(IntegerLiteral il)) = toInt("<il>");
public int eval((Exp) `( <Exp e> )`) = eval(e);
public int eval(mul(Exp e1, Exp e2)) = eval(e1) * eval(e2);
public int eval(add(Exp e1, Exp e2)) = eval(e1) + eval(e2);

value main(list[value] args) = eval([Exp] "(3 + 4)*5");

Apart from the evaluation rule for bracket, all other rules have a clear abstract flavor. Points for discussion:

Are there any disadvantages to this approach? [Performance might be a point of concern.]
Are there better alternatives?

jurgenvinju commented 9 years ago

The advantages are clear, but there are clear disadvantages which makes this not always the best solution:

performance is certainly not optimal; although we are working to use less space for concrete trees on a branch;
it is unclear what the mapping between concrete and abstract lists should be and how the pattern matcher should work. This is a big TODO. Consider eval([Stat head, *Stat tail]) = eval(head) o eval(tail), should it match with any Stat* non-terminal or with list[Stat] or with both? This breaks our type system/pattern matching consistency (pattern matching is a conservative extension of the sub-type relation now, but not if we add that kind of coercion where two different kinds of containers suddenly become comparable.
matching is reasonably clear, but we still need a good answer for constructing concrete trees using abstract syntax. TODO: where does the layout come from? TODO: which disambiguations are executed? TODO: who inserts brackets?
but most of all this only works well (i.e. elegantly) if the grammar is in a perfect shape, namely where it has a minimal amount of non-terminals not many nested symbols and clear constructor names, etc. If the grammar is not in a good shape, you get the worst of both worlds (abstract and concrete), namely lots of accidental complexity and lots of brackets.

I think we should have this, but it does not replace the need for implode or data type usage for ASTs.

jurgenvinju commented 9 years ago

We can also write a static implode checker which lines up a concrete grammar with an abstract grammar and checks if at least at every level there would be an alternative abstract constructor for every concrete constructor.

jurgenvinju commented 9 years ago

Another note: pattern matching should ignore/step-over brackets rules, so the bracket rule should not have to be written in principle. There are some dark corners there which still have to be considered, namely regarding the identity function and not letting brackets magically disappear.

PaulKlint commented 9 years ago

Many points to comment on:

I am not that worried about performance, it works out very well in the compiler and other optimizations are under way.
It is indeed a question whether we should want to mix concrete and abstract lists given enough support for concrete lists. The above example can be easily rewritten as: eval({Stat ";"}+ stats) = eval(head(stats)) o eval(tail(stats)). But I agree: further exploration needed.
I explicitly omitted the mapping from abstract to concrete from this discussion in order not to confuse too many issues but if we only use concrete trees this mapping is redundant. Of course we still need a pretty printer that maps concrete -> concrete.
Indeed a well-formed grammar is essential. Unfortunately, when this is not the case, implode does not offer a solution either :-(
Yes, we can, in principle, statically check whether implode will succeed.
Agreed: brackets and priorities may need further exploration.

jurgenvinju commented 9 years ago

it's not the speed as much as the memory usage compared to ASTs that I am worried about. @DavyLandman's work on CC could not have been done on parse trees because at that scale it just doesn't fit in memory. ASTs need to be a story because of these kinds of applications.
yes more exploration. It would be weird if we create corner cases where pattern matching is not expressive enough or completely implemented on a data-type while we advertise it as the main mechanism to basically do everything.
I don't understand; there is the mapping from the abstract expression to constructing the parse tree in memory which needs a semantics: which parse tree precisely does add(num(1),num(2)) match/produce? and on top of that we need a concrete->concrete formatter too.
I disagree. An implode can do very well on imperfect grammars. That is what it is for, it simplifies the representation. Either a manual implode, or Tijs' implode which flattens superfluous nesting automatically does the job.
yep.
yep.

My feeling is that the world is just not simple. We have different usage scenarios for dealing with syntax trees and we need to satisfy them with different features:

Concrete syntax needs to improve to become a viable alternative for abstract syntax (fewer type annotations, better nesting with other pattern syntax, like deep match)
Abstract syntax needs to be completed to be able to match on concrete trees and produce concrete trees (lists and other regular expressions)
Implode needs to be rewritten to become extensible and more robust (debuggable);

PaulKlint commented 9 years ago

(3) I am talking about a model where there is no abstract representation. Everything uses a concrete tree and there is therefore no standard way to go from abstract to concrete. The only purpose of the "abstract" pattern mul(Exp e1, Exp2) is to match concrete trees. As I mentioned we still do need a pretty printer for a concrete -> concrete mapping. (4) Of course a manual implode can do everything and we do need nothing extra for that. If I see how the current heuristics of implode can bite, I doubt whether more heuristics will work.

I prefer to design a world view that is as simple as possible but is still usable ;-)

Of course we have to see what we need, but I think that deep match is not a problem and can be simply used in the prosed abstract patterns.
I don't understand why concrete patterns should produce concrete trees; the trees are already concrete in the proposed model.
I am becoming more and more pessimistic about the reimplementation of implode in the compiler context. Also a reimplementation in Rascal is problematic and would require extending reified types with overloaded constructors. (But this is yet another discussion, let's keep that separate).

jurgenvinju commented 9 years ago

Well, if we take this one thing at a time, let's finish the abstract matching on concrete trees and that's one down.

I would like to propose that top-level list patterns only match abstract lists, and that nested lists under an abstract constructor pattern will have the concrete list matching behavior (namely skipping layout). For example, eval(if(Exp _, [*Stat _, Stat last])) , would actually work on the concrete parse tree and get the last statement out of the body of an if statement, and eval([*Stat _, Stat last]) would not match with just tree of type Stat* because that is an Tree.appl and not a list[Stat].

Let's talk offline about number 2.

tvdstorm commented 9 years ago

Just for the records: the latest discussion I had with @PaulKlint was triggered by having to deal with overloaded constructors for normalization. This seems to be the only real problem. It's probably even implemented wrong in the interpreter now. In the compiler it's a can of worms.

Normalizing constructor overloads are not lexically scoped: implode needs to know about functions in the caller's environment, not the lexical environment where implode (or any intermediate function) is defined. Even if we keep implode as function (be it in Java, or Rascal), it will need special treatment.

Another solution: build it in completely and make implode a keyword.

tvdstorm commented 9 years ago

Another few cents:

I think we shouldn't mix X* values and lists. Keeping them completely distinct resolves the pattern matching ambiguity.
We can already match using prefix notation, but we cannot build. We could auto-generate explode functions that create concrete syntax trees from trees build in prefix notation. (Of course these can be "overridden" by the user to get custom behavior). These explode functions know about brackets and priorities.
You can always build your custom implode if you need the performance. In the future this pain could be alleviated when we have "folding" visit (i.e. visit that is not type preserving).
Btw: the parser already does "overloaded constructor" invocation in the filtering phase. What would happen if you give an additional reified ADT argument to parse? Could it implode directly?

jurgenvinju commented 9 years ago

with your few cents @tvdstorm:

agreed
@PaulKlint and I have found a solution for this which is elegant. more to come.
certainly, a true parametrized algebraic visit would be great to help implementing implode and related functionality such as outline, unparse, translate;
not right now. the parser builds parse trees, but this is parameterized by an interface. It can build anything in principle and any function can be folded into the map from parse graph to constructor tree