Abstract Syntax Tree - Githubissues

SWP-Comp-Deathk commented 11 years ago

Comments and Discussion for Abstract Syntax Tree See Wiki page: Abstract Syntax Tree

Tkrauss commented 11 years ago

At first, thanks for the image-version of the diagram:). It's almost what we had in mind. Just some remarks:

Do we really need the nested blocks / conditional blocks? Since a block is just a stmt, it should be enough to have the block class referencing statements and declarations.
imho, the role of the token should be replaced by a Class "Node" with an attribute Token. In this case, it' s a less coupled system concerning the lexer and parser modules.
Using this, we need also a Token class. I propose a simple class with a Position ( Line in Code, byte num in Code etc), a enum attribute ( the enum should be like {"num","real","id","add","sub","neg"...}) and a string wich is the string at the pos in the source code ( with reference to the example enum: "23","2.3","MyInt","+","-","-" --- a NonTerminal).
The AST is a result of the semantic analysis, isn't it? If we want to modularize the parser/semantic analysis, we should also introduce parseTree-classes as a direct mapping from the Grammar Symbols to Classes.

fub-frank commented 11 years ago

I am a member of the intermediate code generator group. Anyway i will try to comment on your points.

Do we really need the nested blocks / conditional blocks? Since a block is just a stmt, it should be enough to have the block class referencing statements and declarations.

It is nice to distinguish between the kinds of block. Anyway i agree that blocks are statements. I would therefore introduce it as this: Block extends Statement, NestedBlock extends Block, ConditionalBlock extends NestedBlock. For IR Generation the only thing we need is a reference to the current symbol table in every block of any kind. See Symbol Table for this.

imho, the role of the token should be replaced by a Class "Node" with an attribute Token. In this case, it' s a less coupled system concerning the lexer and parser modules.

Yes i agree, but that is only a renaming. Structure of the graph stays the same here.

Using this, we need also a Token class. I propose a simple class with a Position ( Line in Code, byte num in Code etc), a enum attribute ( the enum should be like {"num","real","id","add","sub","neg"...}) and a string wich is the string at the pos in the source code ( with reference to the example enum: "23","2.3","MyInt","+","-","-" --- a NonTerminal).

Yes we need a token class then. I think we should implement a Token class anyway. The Lexer could then return a Sequence of Token-Object instead of simple Strings. Instead of a simple enum I would suggest a base class Token and child classes for every token that extend the Token class. E.g. NumToken extends Token, RealToken extends Token etc.

The AST is a result of the semantic analysis, isn't it? If we want to modularize the parser/semantic analysis, we should also introduce parseTree-classes as a direct mapping from the Grammar Symbols to Classes.

The AST is the result of the parser. Semantic analysis can happen within the parser, between the parser and the ir generator (a seperate module) or the ir generator. A seperate module seems best for this. Anyway the semantic checker already accepts the AST. It can either return the checked AST or true / false on errors. E.g. public void CheckSemantic(AST ast) throws SemanticErrorException; or public boolean isASTSemanticallyCorrect(AST ast); Therefore we do not need extra classes here.

If we want to have an extra step for semantic analysis (which attributes the syntax tree) then i would suggest using the same AST structure. The parser builds the tree but leaves the attributes empty, the semantic analyzer can then fill in the attributes.

Tkrauss commented 11 years ago

I am a member of the intermediate code generator group. Anyway i will try to comment on your points.

Feedback of every group is welcome;)

It is nice to distinguish between the kinds of block. Anyway i agree that blocks are statements. I would therefore introduce it as this: Block extends Statement, NestedBlock extends Block, ConditionalBlock extends NestedBlock. For IR Generation the only thing we need is a reference to the current symbol table in every block of any kind. See Symbol Table for this.

To be honest, i don't see the need for multiple block-classes. But since you are one of the guys working with the AST, i trust you and i agree on this design;). .

Yes i agree, but that is only a renaming. Structure of the graph stays the same here.

It is, it was just some kind of confusing...

Yes we need a token class then. I think we should implement a Token class anyway. The Lexer could then return a Sequence of Token-Object instead of simple Strings. Instead of a simple enum I would suggest a base class Token and child classes for every token that extend the Token class. E.g. NumToken extends Token, RealToken extends Token etc.

I thought more about how the implementation will look like. In the enum-case, it could be a bunch of switch/case-statements, in the inherited design it would be if-else-instanceof stuff. Maybe we can compromise here with both. I'll talk to my Lexer-group members to keep their opinion in mind. Anyway, it's more an aesthetic issue... .

If we want to have an extra step for semantic analysis (which attributes the syntax tree) then i would suggest using the same AST structure. The parser builds the tree but leaves the attributes empty, the semantic analyzer can then fill in the attributes.

Sounds reasonable. I'm just afraid of the block-class-issue in this case. The outcome of the parser ( the parser tree ) is context-free. But it's not possible to distinguish between a conditional block and a nested block in a context-free way. Or am i the confused one here?:)

fub-frank commented 11 years ago

To be honest, i don't see the need for multiple block-classes. But since you are one of the guys working with the AST, i trust you and i agree on this design;). .

Sounds reasonable. I'm just afraid of the block-class-issue in this case. The outcome of the parser ( the parser tree ) is context-free. But it's not possible to distinguish between a conditional block and a nested block in a context-free way. Or am i the confused one here?:)

Yes this is a problem. If we want to distinguish between the block types then there are two possibilities

The parser has to decide (which it can't)
The semantic analysis has to replace nodes in the AST (which is not nice at all).

These are major disadvantages in the "different-types-of-blocks"-model. You have convinced me of the advantages of the "only-one-type-of-block"-model. Intermediate code generator can work with this as well.

AST should then be fitted with a "branch"-node. The branch node has 2 or three child nodes. first child node: condition, second child node: block if condition is true, third child node: block if condition is false. There is an example of this over at wikipedia

fb13 commented 11 years ago

Darf ich fragen, warum ihr alle auf Englisch schreibt? Soweit ich das gesehen habe, kann jeder von uns Deutsch.

In meiner Gruppe haben wir Enums als Präferenz um Token-Typen (NUM, OP, WHITESPACE, ...) zu verwalten.

Aber wir würden keine Tokens für ADD, SUB, etc. nehmen, sondern stattdessen einfach einen Token-Typ OP definieren. Dieser steht für Operation und hat außerdem noch den Operator, also z.B.:

<OP, "+">

Eventuell können wir OP auch BINOP (Binärer Operator) nennen. Token-design-issue Moved to Lexer -> Goto Issue

Tkrauss commented 11 years ago

I agree to the proposal for the Design @SWP-Comp-Ch3ck3r mentioned ( referenced wiki resp. ). If i get you right, the major task of the semantic analysis is -- aside of assigning and computing the attributes -- removing unnecessary syntactic sugar ( The Expression->Factor->Number... Production sequence for example). In this case we would get a isomorphic AST like we can see at wikipedia. If we choose the design, the Parser Tree design luckily determines also the design of the AST, since it has to be a subset.

fub-frank commented 11 years ago

Then we need to decide what is easier:

Using the same tree for both parser and semantic analyse: This means, semantic analyse has to reconfigure the tree (adding nodes, removing nodes, moving nodes, replacing nodes) which can be quite difficult
Build two different versions of the tree:
1. Semantic analyse uses a subset of the elements already used in syntax tree (subset mentioned by Tkrauss)
2. Semantic analyse has its own tree elements that are disjoint from the syntax tree

I think variant 1 is too difficult. Variant 2.2 is better for clean code separation and uses single purpose objects. Variant 2.1 is implemented faster (less things to do) but not as clean as 2.2.

Looking at the discussion and the arguments, i am now in favor of 2.2 actually.

I have published a component model on wiki home by the way. Wiki Home Page

swpcom-heydu commented 11 years ago

Long thread, correct me if I got something wrong ;-).

The ConditionalBlock is just an abstract class with an expression (condition) and a block as „child“, it would be one of its sub-classes which determine the semantic. I don't see why this could not be parsed context-free. If the parser sees „while (COND) {BLOCK}“ it would create a While instance with those two attributes as children in the AST. It's the same idea as in the image from Wikipedia, just with different names. Or not?

Block vs. NestedBlock, yes, that is not so elegant. With Expression we already have a recursive definition, so NestedBlock could be replaced by Block.

Should Equality be merged with Relation?

Why can't the parser produce the AST. Why is there a need for a parse tree?

flofreud commented 11 years ago

Should Equality be merged with Relation?

They can have different valid operands: num == num and num <= num works but not bool <= bool. Therefore it will be easier for symantic analysis if they arn't merged, i think.

EsGeh commented 11 years ago

Hey there,

Please correct me, if I got something wrong, but it seems to me that the proposed design (given by the UML Diag) is quiet complex, and also quiet fixed to our specific grammar. Maybe it is possible to find an easier design, with less complicated class hierarchy, by taking a more general approach. (By the way I would propose to make less detailed diagram at first, which shows the basic idea of the design, rather than putting to many details in. This would also simplify the discussion, I think)

Why not having something like this (rough draft, in pseudocode):

[PSEUDOCODE] type SyntaxTree = Node type AST = SyntaxTree

class Node < RestrictionOnSubNodeCount > { Tag tag; Container < Annotations > annotations Container < RestrictionOnSubNodeCount, SubNodes > subNodes; }

type Tag = Either<Variable,Terminal>

enum Variable { Program, Declaration, Statement, ... } enum Terminal { + , - , ..., ifelse, ... }

class/interface Annotation {...}

class TypeAnnot deriving Annotation {...}; class BlaAnnotation deriving Annotation {...}; ... [/PSEUDOCODE]

(a tag is either a Variable or Terminal, to make it possible using the same data structure as general SyntaxTree (uses Variables for inner nodes), as for representing a AST(also uses Terminals for inner nodes). This design allows also for things in-between, which is not nice (=> find a solution by discussion)).

Now the question is, what kind of Annotations are needed, and what is common between them (design the "Annotation" class/interface).

First of all my point is, to use a more general design, by first looking on the similarities of classes in behaviour, and secondly on their differences (always asking, if derivation is needed, or if the same thing can be achieved by using a data field to distinguish their behaviour).

swpcom-heydu commented 11 years ago

It would allow bad things like a while-loop with multiple bodies. Why should this be allowed?
What would be the benefit of your idea? The complexity has to be somewhere. Either in the design (implicit, such as only one body for while) or in the code (explicit, check the number of those bodies in the code).
What are “those things in-between”?

Tkrauss commented 11 years ago

I think this wouldn't be good idea, too. This design isn't really complex, since it is more or less a translation of the cfg ( and of course it's fixed to our grammar, because our task is to use this very grammar). If we use the approach of @EsGeh , you'll probably get many runtime errors in case of a erroneous implementation. The most errors can be covered statically ( the first child of a Block MUST be a Declarations-object, not just some Element), which i would prefer. _To summarize the discussion till now_

We'll delete the Token-class since it is the matter of the lexer . But we need some kind of a reference in a AST-node( or leaf resp. ) for debugging purpose.
Also, a AST node has to carry some information like: type (bool,real, number, record), environment attributes ( symboltable for each block ). I think it's the best idea to realize these properties as regular object attributes in each class which extends the AST-node class ( so the block has a symboltable, a id has a type etc.).
The Output of the Parser is a AST with correct semantic.
How to get to the AST is the matter of the parser-implementation. I think the first group tends to use the parser tree as a direct deviation of the grammar and builds a AST upon this while the second ( LLVM ) group may translate it directly. But this kind of granularity of the implementation is no more a topic in this common discussion. The only point that is important is that the code generation group(s) agree to the design of the AST, since this is what they get.

EsGeh commented 11 years ago

I've been thinking about some of these problems while going for a walk. Of course the obvious problem of my proposal is that it has nearly no type restrictions. I think the necessary type restrictions could be achieved by using Java Generics. It still makes it possible to seperate Grammar specific things from the general aspects of the model, which I consider more clear.

fub-frank commented 11 years ago

Somebody should start uploading interfaces. That would make discussion easier, at least it would show some progress.

swpcom-heydu commented 11 years ago

I've pushed a first draft. Feel free to edit as you like.

ghost commented 11 years ago

I've pushed a first draft. Feel free to edit as you like.

Thank you for this. I have reworked the ast using your provided structure as a template. Please find the new ast in the common.ast package. I will complete the javadoc documentation later.

Here is a big image documenting the class hierarchy click here for original size

structure

It should be complete.

One thing is still missing: The interface IdentifierNode. This Node needs to be able to represent access to an identifier of any type and any depth (array, struct, array in array, struct in struct etc). I am not clear about how to do this yet.

If this tree does not fit for the Syntax Tree the parser needs, syntax tree and abstract tree need to be two different definitions.

This design should combine two advantages

it is a nice class hierarchy describing all the nodes in the tree
you can still use a switch() statement by referring to ASTNodeType enum.

Tkrauss commented 11 years ago

Thank you so much, i think with this design we're on a right way. @Identifier problem: How about using the composite pattern? We should get a Identifier class and some extending it ( StructIdNode, BasicIdNode, ArrayIdNode). Each of the subclasses reference other Identifier classes if necessary ( the BasicIdNode doesn't need to, the ArrayIdNode should reference 1 other Id class, the StructNode should hava a tupel-list/map with attribute names and the corresponding classes).

EsGeh commented 11 years ago

I have made an UML-Diag to illustrate another design strategy. Of course the Diagram Ch3ck3r proposed is very elaborate, and seems to be quiet adequate. Maybe it can be merged with the idea I proposed? SyntaxTreeModelDraft

fub-frank commented 11 years ago

Sorry i used my wrong username in the post i just made:)

@EsGeh Yes i will try to merge these. Should be possible easy enough by adding another hierarchy into the ast design.

@Tkrauss that is a good idea, i did not think of extending identifiernode any further. I will try to implement this.

EsGeh commented 11 years ago

cool!

fub-frank commented 11 years ago

I have updated the interface to contain a correct definition of IdentifierNode, thanks to Tkrauss!

@EsGeh While trying to merge the two designs i encountered two problems:

There are some nodes that are none of LeafNode, UnaryNode, BinaryNode, etc. E.g. the StatementNode has subclasses of all node types. This would end up in something like StatementNode extends ASTNode and then BranchNode extends StatementNode, TernaryNode<T1, T2, T3>.
The method names in your model are quite generic (e.g. getFirstNode). In the other model the method names are more concrete (e.g. getCondition). I like the concrete names better. If we keep both, the interfaces would have two different methods for the same action e.g. BranchNode would have getCondition() and getFirstNode() returning the same thing.

Anyway your model should be perfect for implementing the interfaces in means of abstract classes.

Maybe someone has some thoughts on this.

Tkrauss commented 11 years ago

To be honest, i don't see the need for a more generalized design at the moment. The big plus of a more generalized design will be that we are able to fit the software to another grammar. Anyway, that's not our task but the task of the students next year:). The big minus would be that debugging will be awful, since we have no clue about the concrete Elements of the grammar, and the relation to the grammar resp..

EsGeh commented 11 years ago

In fact, I am thinking about whether it is correct that so many kinds of nodes are deriving "StatementNode". There is nothing, that it's children have in common. What we would like to model is that a statement can be ONE OF some kinds of classes. I think a union/variant pattern would fit better. Something like: class Statement { enum Type { BRANCH, RETURN, ...}; Type getType();

BranchNode getAsBranchNode() throws NotABranchException();
ReturnNode getAsReturnNode() throws NotAReturnNodeException();
...

}; This has a subtle different semantic than a normal derivation. I think it fits better, for our case.

fub-frank commented 11 years ago

Doing it your way would double the size of the tree, as almost every node (currently extending the statement) would end up as two nodes (e.g. StatementNode and BranchNode instead of BranchNode).

The current model depicts the grammar itself. A branch is a special kind of statement. That is what extension is meant for: it specializes the parent class.

SWP-Comp-Deathk commented 11 years ago

Hi,

there are good reasons java does not support a union datatype. It was excluded from java because of several issues it caused in c and c++. Java's way of doing this is to use a class hierarchy and marker interfaces. You can have a look at this here: http://lambda-the-ultimate.org/node/2694

We should not start implementing a union type into java when it was excluded from java by purpose.

swpcom-heydu commented 11 years ago

@SWP-Comp-Ch3ck3r wow, you changed a lot. That is in many ways not the design in the model. Those changes truly need a discussion.

I've made (abstract) classes because there is a lot of boilerplate code (such as getter/setter).
Why is every class also represented in the ASTNodeType enumeration? For me this is just a nicer way of asking for the classes name. Needs discussion, because both (enum and class name) have the same design problem.
Why are If, While and DoWhile totally different classes? They have the same properties, just different semantics. This is what ConditionalNode was meant for.
Why do you use the boxed types in getNumberOf*?
What is the idea of grouping the nodes by number of operands instead by their meaning? This is, at least to me, very confusing. The framework is meant for students and should be is easy to use as possible.
What is the idea of ArrayIdentifierNode? Why is this not just an attribute of Identifier?
Although I asked those many questions, there are also things I like very much :-). The tree-handling (getChilds) and the Iterators are a nice idea. But also boilerplate code. The merge of all those binary operations like add, or, ... to one class with enum is also a nice idea. Makes it a lot easier to handle.

fub-frank commented 11 years ago

Implementation should be the work of the groups themselves.
This is a nice to have feature to use a switch statement instead of if instanceof X else if instance of Y else if instance of Z etc. This design was chosen in other datastructures already (like Type). I just copied this.
They are not. Please note the difference between classes and interfaces. You can use the same class implementing several interfrace. E.G. class "Loop implements DoWhileLoop, WhileLoop, etc." Anyway there needs to be a way of differentiating between them as they produce different TAC.
I don't know if it is useful to be able to get the corresponding node numbers. Better to have it than to miss it.
I think you are refering to UnaryExpression UnaryArithmeticExpression etc.?! Again: These are interfaces. Implementation can group them in another way. Also extending a common interface is not grouping but using the same functionality. Functionality changes between binary and unary types, not between logical and arithmetic etc.
Arrays can be deeper than one level. Using an attribute would only allow access to one level. arr[1][2][3][4] is not representable by one attribute
Thanks - and why is this discussion taking place so late. This discussion is open for days, it's quite late now as this definition has to be defined until this evening.

swpcom-heydu commented 11 years ago

Why is this discussion taking place so late ... well, that may be my fault :-P. I'm fine with every “interface”, just like to ask questions :-).

Why should both implement the same boilerplate?
This in combination with not using classes is error prone. Every student needs to implement this.
The TAC guys can differentiate also when using classes. Sorry, but that is not an argument.
The question was: Why using Integer instead of int? What is the benefit? I'm not that into Java.
Functionality changes between both the number of operands and the kind. Both groupings are valid. For me this is confusing. If I'm the only one that should not be a problem.
Good point, missed that.

fub-frank commented 11 years ago

You have almost answered this one yourself. Interfaces can be implemented differently between groups. Use your example of grouping operations. The groups can decide how they want to group their implementaiton. Getters and Setters can be easily created by source code generators like eclipse's one. Anyway there is always the possiblity to implement this once for both groups and push it to the common repository as long as the interfaces are defined.
Yes it is. Switch-able enumerations aren't my favourites either. This request actually came from other students (see #8 Types for example). If its going after me we could completely remove the enumeration as there is the instanceof operator. Though we have seen some students unaware of instanceof operator already in these issues.
Yes they can when using classes. I meant they can not differentiate when you use one class for all kinds of loops and branches. At least not by using instanceof. If it is only one class then there is the need of a getter like "getKindOfLoop". This would be a multipurpose object (so called god object) then. This is against OOP design.
Integer has more methods that are available for your use. Check out java.lang.Integer or have a look over here http://mindprod.com/jgloss/intvsinteger.html. Automatic unboxing from Integer to int (since java 1.5) is always at your finger tips. Automatic boxing isn't available always.
Again, as these are interface you can implement them as you like. In the concrete AST you will not find the nodes BinaryExpression or UnaryExpression as you will only find the leaf nodes like BinaryArithmeticExpression. I think it is more important to distinguish between binary and unary operations in a syntax tree (because this decides the look of the tree in the matter of children) than to differentiate between arithmetic and logical operations. Therefore i think this grouping should occur on the higher level. But this is a matter of personal preferences.

Tkrauss commented 11 years ago

In my opinion, the most questions has already been discussed here or in similar threads ( like enums versus inheritance). I prefer using the interfaces as they are. I think we can be pretty happy if the only problem which occurs will be using switch statements instead of if else or the other way around. Since the design is gross enough, every group can decide itself wether to use another design approach inside the modules and can simply adapt the interfaces on which we should agree in the next days.

swpcom-heydu commented 11 years ago

Maybe you are right and I'm a little to strict on this. So we consider the interface as accepted for M1?

fub-frank commented 11 years ago

I would say so. If we run across major issues when working with this there is always the possibility to adapt the interfaces or design by refactoring or other means.

Tkrauss commented 11 years ago

I would prefer this, too. I think we have a nice approach atm and afaik, every person involved agreed or had just little objections. As long as there are no worries, I'm going to translate the current UML to code in the next minutes

EdwarDDay commented 11 years ago

We just met as semantic analyse group and think, that we need a unary Operator for casts. This way we can add a cast to the AST. Additionally we need a key-value-store (i.e. string-object-map) in the ASTNodes for the SDD.

fub-frank commented 11 years ago

We just met as semantic analyse group and think, that we need a unary Operator for casts.

if we want to add casts in the AST already that is correct. This addition is not hard though. Anyway i am not sure whether casts should occur in the ast or whether they should be generated by IR only.

Additionally we need a key-value-store (i.e. string-object-map) in the ASTNodes for the SDD.

The SDD is not part of the AST. It is used to generate attributes in the AST. Those attributes should already be defined in the AST (leftValue, rightValue, Condition, Identifier etc). Therefore it should not be part of the AST itself. Please clearify why this needs to be in the ASTNodes, and which values would it hold then.

fub-frank commented 11 years ago

@Tkrauss I saw you checked in AST interfaces here https://github.com/swp-uebersetzerbau-ss13/common/tree/master/interfaces/src/swp_compiler_ss13/common/parser Did you know they are already checkedin here https://github.com/swp-uebersetzerbau-ss13/common/tree/master/interfaces/src/swp_compiler_ss13/common/ast ??

Because AST is nothing parser special but is instead used by three components (Parser, Semantic, IR) it should not be buried into parser specific packages.

Tkrauss commented 11 years ago

@SWP-Comp-Ch3ck3r Yes you are right. didn't take a look into the other packages as i found the AST class in the parser package. I'll remove my redundant code.

fub-frank commented 11 years ago

Although the ast package is not documented yet either. Feel free to document it. I will take a look at it in a few hours and document the code if it is not done until then.

EsGeh commented 11 years ago

Discussion has gotten complicately. Let us stick to uml diagrams at first. It's easier to use them for discussion, and It is easy to generate java interfaces from them. Proposal:

let this issue be the issue for comparing different proposals.
start another wiki and issue for the design we agreed of so far. I have tried to make a diagram, which is supposed to sum up the discussion so far. Does everyone agree, to use it as a base for the new thread? Let us then try to take step by step carefully, so we don't have to discuss 50 things at once: (common/doc/alternateDesignForSyntaxTree/syntaxTreeModel.dia):

SyntaxTreeModelDraft

Of course, this is incomplete by far (just chapter 1, and 2 of the grammar have been taken into account), but if everyone is ok, with what is there yet, we can go on discussion what are the next things to decide.

fub-frank commented 11 years ago

I dont get this. As far as we have come together in this thread discussion is over and design is decided already. the only question left is by EdwarDDay concerning SDD.

Time is up anyway. Deadline was Monday 23:59:59 and presentation are being prepared now.

EsGeh commented 11 years ago

Whatever. I think the current AST-Definition is just horrible. I think we should take the time we need, since the AST-Definition is decisive for our whole project. In fact I think the current version is a step backwards. Whatever.

fub-frank commented 11 years ago

I think the current AST-Definition is just horrible

And you have only stated 2 reasons for that

you would like to use union types instead of inheritance
you would like to use generic nodes.

Those issues have been discussed already

java have no union types for a good reason. See java documentation for reasons, there was one linked already.
This does not interfere with the current design as you are free to * implement * those interface the way you like.

I think we should take the time we need

You have had 14 days to do this. It's not my problem (and not the one of others) if you start discussion the last day or two. We can not just use as much time as we want. If we would do so, this project would not make progress at all.

P.S.: Some groups have started implementing already (i don't know if ast groups are invovled). Changing all the interfaces now would make their progress vanish.

swp-uebersetzerbau-ss13 / common

Abstract Syntax Tree #7