Closed flofreud closed 11 years ago
As mentioned in another issue, the lexer group 2 prefer the following token types: NUM, REAL, OP, WSPACE. The tokens get two parameters, for example:
<NUM, "3", "3">
or
<WSPACE, "\t", "5">
The first parameter specifies the token itself. The second parameter specifies the LOC of the found token.
In group talk yesterday we came to the conclusion, that it would be nice to give not only the LOC but also the position in the line. I cite Tkrauss / #7:
"Using this, we need also a Token class. I propose a simple class with a Position ( Line in Code, byte num in Code etc), a enum attribute ( the enum should be like {"num","real","id","add","sub","neg"...}) and a string wich is the string at the pos in the source code ( with reference to the example enum: "23","2.3","MyInt","+","-","-" --- a NonTerminal)."
Because the lexer already look onto every character, it would be nice it would generate different token types for at least every grammatically distinguishable token. So you would give different tokens for + and * which have different priorities in the grammar. I think this will help write a more readable parser code and will not be much more effort for the lexer.
Token with 3 parameter
<AOP, "-", 3, 4>
Addition-Operation with "-" in line 3 char 4?
I looked up the way, Drachenbuch solves the problem with Tokens and line numbers. There, the Token does not keep the line number, but the symbol table does. A token contains an identifier and an optional attribute. And if an identifier is found, the optional attribute contains a pointer to the symbol table, where the line number is kept. This can also be done with constants, such as numbers. Instead of keeping the number as a string in the token, we can point again to the smybol table, where the value of the number and the line, in which it was found, is kept.
Thus, a class Token has two variables, the identifier and a reference to an symbol table instance, which has other attributes, especially a line number.
Any agreements? Other ideas?
The advantage of associating every token with a line number would be, that we can use it later if ( or when resp.) we are able to visualize errors. Imagine the program "}{}". Since there is no identifier, not a single information about the position is forwarded to the parser. The parser will detect an error when consuming the <CLOSING_BRACE,'}'> token. Imho it would be nice, if we have the possibility to throw an error referencing exactly this position of code. But i like the idea of referencing the symbol table when a ID is found. Do you have an idea how to reference it with the current symbol table design?
I understand the idea that the line number is important for errors, discovered during the parsing porcess. But I do hardly understand, why we should place them in a token. A token consists of two elements, the name and an optional attribute. We should not change this. Instead, we plcae the line number in the symbol table. I am working on our lexer design now. Is is important that the lexer keeps, in which scope an identifier or constant was found? I thought, the lexer just collects tokens and pastes information into the symbol table, where and what was found, not in which scope. Therefore, I thought the lexer should build an own symbol table without the scope management, a simple hash table. And the parser is responsive to extract the needed information. What about that?
If i get you right, there are two questions:
But I do hardly understand, why we should place them in a token. A token consists of two elements, the name and an optional attribute. We should not change this. Instead, we plcae the line number in the symbol table.
If we're doing this, we can just debug identifier-related problems by referencing the source code. This decision will restrict the scope for development concerning the traceability "from source code to Byte code".
Therefore, I thought the lexer should build an own symbol table without the scope management, a simple hash table
Did not think about that problem... It's pretty to the issue #5 (grammar). Regardless of which option we will choose, we will have a problem. Every Identifier declared in a block must not exist in the parent block after the block in which it was introduced. To recognize such a thing, we need to know the current scope ... but the earliest point where this kind of information is known is in the Parse-Tree / AST.
I thought, that scope management is not the task of the lexer. The lexer just reads each lexeme and if it has found the same lexeme again (e.g. the identifier "a", which was read in line 5), it returns a new token with the new line number (id,13,reftosymboltable?).
I know what you mean... Now consider this example:
{
int y
y=1
}
int y
In the case "int y" will be tokenized, y must not be found in the symbol table. But this is not possible without considering the scope... .
Why should y not be tokenized twice? Its occurence depends on line and the lexer doesn't care about scopes and nested structures. Or am I seeing something wrong?
Then, at which step the lexer should return the symbol table or is just one table used over all instances/phases of the compiler? We defined, that the lexer is called by the parser with a method "getNextToken" (I will complement the interface on the wiki page as soon as possible!) and returns a token. Were does the symbol table be returned?
Latest discussion state is, that the lexer will return the complete token list instead token by token, if i read issue #9 correctly
On Sun, Apr 21, 2013 at 8:30 PM, Damla Durmaz notifications@github.comwrote:
Why should y not be tokenized twice? Its occurence depends on line and the lexer doesn't care about scopes and nested structures. Or am I seeing something wrong?
Then, at which step the lexer should return the symbol table or is just one table used over all instances/phases of the compiler? We defined, that the lexer is called by the parser with a method "getNextToken" (I will complement the interface on the wiki page as soon as possible!) and returns a token. Were does the symbol table be returned?
— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16738810 .
Absolutely right... i just worry about the information encapsulated into the token. Recap the example
{
int y
y=1
}
int y
The Symbol "y" is already in the symbol table while tokenizing the last line. If the lexer returns the reference to the entry in the symbol table with the id, which entry will be referenced? the existing one for y? Some new instantiated for a new symbol y? Anyway, how reliable is the information, the parser has to use a new symbol table with the correct information considering the different block levels. Or am i wrong? Correct me if i am, it's late and as far as i know, Prof. Fehr didn't teach us using symbol tables in case of different blocks. I'm afraid of using just one symbol table for different nested blocks. But i don't know how to solve that problem...
@flofreud : Although we didn't agree on this design, it's okay for us if i may represent my group ( @dildik please correct me if i'm wrong). It's just a detail.
I read the comments of jvf who said that it should be done as in the dragon book, which provieds a getNextToken method. Thus, I understood his comments the other way :)
@flofreud It's okay if you represent your lexer group. I currently represent my lexer group. We should decide for a design. In our group we have written an interface (which I am uploading now), where we used the getNextSymbol-design. But if you mean, there are hard disadvantages of this decision, just tell!
I'm kind of confused atm. Just to make it clear: @flofreud , @dildik and myself are in the same group/project;)
...no, we have two lexer groups. I am in the javabite lexer group, together with @fb13 and @akrillo89 and each lexer group should have one person representing this group. And these two persons should decide together about the interface.
@dildik Im in your group and try to cross-reference to related issues. I overlooked the remark for the lexer and thought the propose a different design, but they dont.
@dildik Im also in the Javabite group and the interface has do be defined in respect to concerns of Parser and overall architecture, but this is no issue, because all are on same position for the calling, it seems
Ok. Thus the getNextToken design is decided? I am a bit confused, too :P The last problem is the management of the symbol table. But I think, the interface doesn't give any information about the table, so that the interface can be decided independent of the symbol table management...?!
@dildik I don't know wether it's decided... Maybe it's smart if you open a new issue, assign it to @EsGeh ( he's the representant from group two for lexer/parser) and you both will make a decision? We don't need the whole groups to be involved in this in my opinion... .
@dildik again, did you understand my worries about the symbol table at the lexer? Maybe it will make sense to remove the interaction between the lexer and the symbol table completely so that the parser will take care of things like symbol referring in the correct block context
The problem is, that the interfaces should be decided until tomorrow. Therefore, I cannot wait. I uploaded our interface proposals in the src directory under lexer. I already defined a proposal Token class.
@dildik For what is the whitespace token needed? I think the grammar need whitespaces only to seperate tokens (and the lexer can ignore multiple)
the fuc group - when we discussed this last thursday - was indeed in favor of a getNextToken() method to make the design as usable as possible for the lecture
@jvf Right. We should stick as much as possible to the lecture book. And as I remember the getNextToken() method is the one Fehr used in her lecture.
:+1: to the getNextToken()-Method. There seems to be a common sense about the definition of signed numbers i dont't support unlimited... In the grammar exists the unary "-"-operator. Combined with the alternative signing of a int, we can interpret "-2" as two tokens ( - and 2) or as one token (-2), dependent if we see it as a signed Int or negated unsigned int. Thus we have a ambiguity we have to remove asap imho. One Solution can be to allow just unsigned reals/ints/etc.
A few thoughts:
3 . The Lexer needs to be able to make entries into the symbol table hierarchy, right? So it needs to hold a reference on it. It also needs a method "setSymbolTable(SymbolTable table)" to set the currently "active" symbol table (see Symbol Table wiki). Again, if you agree, I can make a new design which includes it.
(shit, github recognizes enumerations, and starts them from 1. :-P)
My opinion:
2 . I think it would be nice, if the lexer could do that. If you recall the lecture, there was a step saying "Erkenne Konstanten". I am not insisting on this, but still I think it would be nice to let the Lexer do that job, and let the Parser concentrate on treating with the aspects of the grammar only. Further opinions?
If we want to do so, we have to distinguish between the different classes. Keeping just the one "Token" class won't be appropriate.
My idea: Use of generics. There is the one interface "Token" which is parametrized by the generic parameter V. Since a Token has always a string representation, there is a method "string getValue()" specified. Plus the method "T getExactValue()" returns a Object of the generic type.
So there would be a class "StringToken implements Token
It should be no problems to let the visualization use another getNextToken-Lexer-instance and pull all tokens from it. The concerns of visualization should only change the lexer if really necessary and i do not see why it cannot loop the tokens from an lexer instance. If visualization is used, the lexer is most likely used separately from other modules.
@dildik and I just made an uml proposal for the lexer.
See https://github.com/swp-uebersetzerbau-ss13/common/wiki/Lexer.
We think it isnt necessary to have such "RealToken", "StringToken", etc. classes. We just have the Token class, that has methods to identify the type of the token. We return a generic value and the parser can identify the token type with a method getTokenType();
@fb13 and @dildik : Please look into #5 where we conclude that the - before a number should be a token itself.
More a detail thing but in the grammar discussion the E-notation was added for writing bigger numbers and we probably should only use dot and not dot or comma as separator. Also the Tokens for string and id can consists of more than one character and the id shouldn't be starting with a number.
[...] it is most definitely part of the Lexer to recognize numeric tokes and calculate their numeric value during lexing. If you look and the examples from the dragon book, the Tokens num and real have integer and float values as lexemes, not strings, and the numeric values are calculated during lexing (see Chapter 2 and Apendix A from the dragon book).
As stated above, i dont think String is sufficient as value for tokens!
As stated above, i dont think String is sufficient as value for tokens!
You mean, that we need special kinds of tokens? In the appendix of the dragon book, I see that string tokens and num tokens are separated. But why cannot we just have one token class and a generic type as value?
But why cannot we just have one token class and a generic type as value?
Generics are not helpful in this case. Generic types in Java are a compile-time feature. The type information is not accessible in runtime. We need the type information to runtime and the static type of the field would be Object for a generic without type extension. The compiler cannot interfere the generic type information for our case so we need to cast explicitly. We could also use a Object field and would cast by token type to the correct inner type instead casting the Token-object to the correct token type.
Imho the casting of tokens seems more cleanly in design as the casting of the object field. I would like a token hierarchy for num, real, bool and string (which covers all the rest).
My idea: Use of generics. There is the one interface "Token" which is parametrized by the generic parameter V. Since a Token has always a string representation, there is a method "string getValue()" specified. Plus the method "T getExactValue()" returns a Object of the generic type. So there would be a class "StringToken implements Token" which contains Tokens like "do, while,{, ...". Another class would be "RealToken implements Token". In that case we have everything we have all information we need and the parser can "peek" the information it needs.
Just repeating my previous idea... Imho this design will do the job.
I think, it's better to use only one special Token class, instead of an Interface and Generics. We can implement the return type of the getValue method as an Object, so we can put String, Real, Num, Int Objects in it. And the type of the value is stored as TokenType, so we know, what cast is necessary.
Quite frankly i don't get the fuss. Why not just a (small) token hierarchy, with a base token class and num, real etc classes extending it? Why use an OO language, if not using it?
Instead of creating an own ENUM type for "+" and "-" we could use ARITHMOP. For example:
-2
would result in
<ARITHMOP, "-", loc, coc><NUM, "2", loc, coc>
What do you think?
@kensan83 In fact I had the same idea, but in the "Dragonbook" (:D) they distinguish between token types, so we should do this as well...
@fuc-thomas Thanks, but this is exactly the uml that @dildik and I created and uploaded. So what? It isnt correct, because we are missing Tokens for num, real, string and boolean, as this discussions shows. Also it isnt clear, whether we should use an ARITHMOP or just create a SIGNOP for cases like "+3" or "-5".
@jvf, @fb Well, then let us use a token hierarchy.
added a proposal in the repo, under common/doc/lexer.
It is not yet complete.
@EsGeh Why do we need a class LocationInSource, when we could just place those information in the token itself. As the discussion showed, most people here would prefer NUM, STRING, REAL and BOOL-Tokens, why do you propose NUM, ID and SIMPLE?
@dildik just uploaded a new version of our uml:
Well. Real, and String are supposed to be added. Anyone feel free to do so. As far as I know, from the view of the parser, the tokens should represent the terminal symbols of the grammar. In the grammar there is no terminal "bool", but 2 terminals called "true", and "false". Since both do not carry additional information, SimpleToken can be used to represent them.
By the way, I don't think the terminal definitions (by which I mean the regular expressions/definitions) should be part of the interface, since other components don't need to know them.
We removed the "," seperator and added a num and real type for exponential notation.
what do you think about that?
For Strings you should make a begin and end symbol ("?) and between these are all (ascii?) characters allowed, you could get the mightiness of non-printable statements through java easily but i dont think that the matching for "..." is possible with regex if \" should be allowed. If we want a regex lexer possible than we could match with "[^"]*" but probably my assumption is wrong.
You still have to remove the (+|-) from the tokens. The unary minus need to be handles as an extra token (mentioned by parser/semantic group).
On Tue, Apr 23, 2013 at 4:12 PM, Sebastian notifications@github.com wrote:
We removed the "," seperator and added a num and real type for exponential notation. [image: lexer]https://f.cloud.github.com/assets/2676770/414687/c01efc88-ac1f-11e2-864a-40bc41ba1018.jpeg
what you think about that?
— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16860346 .
Well, what do you think about the decision to remove "-" from UNARYOP? This solves the overlap of UNARYOP and all regexp with " - ".
We cant change the grammar. Why should there the minus in regex for the numbers? Its unnecessary.
On Tue, Apr 23, 2013 at 4:46 PM, Sebastian notifications@github.com wrote:
Well, what do you think about the decision to remove "-" from UNARYOP? This solves the overlap of UNARYOP and all regexp with " - ".
— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16862544 .
Comments and Discussion for Lexer See Wiki page: Lexer