Lexer Specification - Githubissues

flofreud commented 11 years ago

Comments and Discussion for Lexer See Wiki page: Lexer

fb13 commented 11 years ago

As mentioned in another issue, the lexer group 2 prefer the following token types: NUM, REAL, OP, WSPACE. The tokens get two parameters, for example:

<NUM, "3", "3">

or

<WSPACE, "\t", "5">

The first parameter specifies the token itself. The second parameter specifies the LOC of the found token.

flofreud commented 11 years ago

In group talk yesterday we came to the conclusion, that it would be nice to give not only the LOC but also the position in the line. I cite Tkrauss / #7:

"Using this, we need also a Token class. I propose a simple class with a Position ( Line in Code, byte num in Code etc), a enum attribute ( the enum should be like {"num","real","id","add","sub","neg"...}) and a string wich is the string at the pos in the source code ( with reference to the example enum: "23","2.3","MyInt","+","-","-" --- a NonTerminal)."

Because the lexer already look onto every character, it would be nice it would generate different token types for at least every grammatically distinguishable token. So you would give different tokens for + and * which have different priorities in the grammar. I think this will help write a more readable parser code and will not be much more effort for the lexer.

Token with 3 parameter

<AOP, "-", 3, 4>

Addition-Operation with "-" in line 3 char 4?

damlabeyaz commented 11 years ago

I looked up the way, Drachenbuch solves the problem with Tokens and line numbers. There, the Token does not keep the line number, but the symbol table does. A token contains an identifier and an optional attribute. And if an identifier is found, the optional attribute contains a pointer to the symbol table, where the line number is kept. This can also be done with constants, such as numbers. Instead of keeping the number as a string in the token, we can point again to the smybol table, where the value of the number and the line, in which it was found, is kept.

Thus, a class Token has two variables, the identifier and a reference to an symbol table instance, which has other attributes, especially a line number.

Any agreements? Other ideas?

Tkrauss commented 11 years ago

The advantage of associating every token with a line number would be, that we can use it later if ( or when resp.) we are able to visualize errors. Imagine the program "}{}". Since there is no identifier, not a single information about the position is forwarded to the parser. The parser will detect an error when consuming the <CLOSING_BRACE,'}'> token. Imho it would be nice, if we have the possibility to throw an error referencing exactly this position of code. But i like the idea of referencing the symbol table when a ID is found. Do you have an idea how to reference it with the current symbol table design?

damlabeyaz commented 11 years ago

I understand the idea that the line number is important for errors, discovered during the parsing porcess. But I do hardly understand, why we should place them in a token. A token consists of two elements, the name and an optional attribute. We should not change this. Instead, we plcae the line number in the symbol table. I am working on our lexer design now. Is is important that the lexer keeps, in which scope an identifier or constant was found? I thought, the lexer just collects tokens and pastes information into the symbol table, where and what was found, not in which scope. Therefore, I thought the lexer should build an own symbol table without the scope management, a simple hash table. And the parser is responsive to extract the needed information. What about that?

Tkrauss commented 11 years ago

If i get you right, there are two questions:

But I do hardly understand, why we should place them in a token. A token consists of two elements, the name and an optional attribute. We should not change this. Instead, we plcae the line number in the symbol table.

If we're doing this, we can just debug identifier-related problems by referencing the source code. This decision will restrict the scope for development concerning the traceability "from source code to Byte code".

Therefore, I thought the lexer should build an own symbol table without the scope management, a simple hash table

Did not think about that problem... It's pretty to the issue #5 (grammar). Regardless of which option we will choose, we will have a problem. Every Identifier declared in a block must not exist in the parent block after the block in which it was introduced. To recognize such a thing, we need to know the current scope ... but the earliest point where this kind of information is known is in the Parse-Tree / AST.

damlabeyaz commented 11 years ago

I thought, that scope management is not the task of the lexer. The lexer just reads each lexeme and if it has found the same lexeme again (e.g. the identifier "a", which was read in line 5), it returns a new token with the new line number (id,13,reftosymboltable?).

Tkrauss commented 11 years ago

I know what you mean... Now consider this example:

{
 int y
 y=1
}
int y

In the case "int y" will be tokenized, y must not be found in the symbol table. But this is not possible without considering the scope... .

damlabeyaz commented 11 years ago

Why should y not be tokenized twice? Its occurence depends on line and the lexer doesn't care about scopes and nested structures. Or am I seeing something wrong?

Then, at which step the lexer should return the symbol table or is just one table used over all instances/phases of the compiler? We defined, that the lexer is called by the parser with a method "getNextToken" (I will complement the interface on the wiki page as soon as possible!) and returns a token. Were does the symbol table be returned?

flofreud commented 11 years ago

Latest discussion state is, that the lexer will return the complete token list instead token by token, if i read issue #9 correctly

On Sun, Apr 21, 2013 at 8:30 PM, Damla Durmaz notifications@github.comwrote:

Why should y not be tokenized twice? Its occurence depends on line and the lexer doesn't care about scopes and nested structures. Or am I seeing something wrong?

Then, at which step the lexer should return the symbol table or is just one table used over all instances/phases of the compiler? We defined, that the lexer is called by the parser with a method "getNextToken" (I will complement the interface on the wiki page as soon as possible!) and returns a token. Were does the symbol table be returned?

— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16738810 .

Tkrauss commented 11 years ago

Absolutely right... i just worry about the information encapsulated into the token. Recap the example

{
 int y
 y=1
}
int y

The Symbol "y" is already in the symbol table while tokenizing the last line. If the lexer returns the reference to the entry in the symbol table with the id, which entry will be referenced? the existing one for y? Some new instantiated for a new symbol y? Anyway, how reliable is the information, the parser has to use a new symbol table with the correct information considering the different block levels. Or am i wrong? Correct me if i am, it's late and as far as i know, Prof. Fehr didn't teach us using symbol tables in case of different blocks. I'm afraid of using just one symbol table for different nested blocks. But i don't know how to solve that problem...

@flofreud : Although we didn't agree on this design, it's okay for us if i may represent my group ( @dildik please correct me if i'm wrong). It's just a detail.

damlabeyaz commented 11 years ago

I read the comments of jvf who said that it should be done as in the dragon book, which provieds a getNextToken method. Thus, I understood his comments the other way :)

@flofreud It's okay if you represent your lexer group. I currently represent my lexer group. We should decide for a design. In our group we have written an interface (which I am uploading now), where we used the getNextSymbol-design. But if you mean, there are hard disadvantages of this decision, just tell!

Tkrauss commented 11 years ago

I'm kind of confused atm. Just to make it clear: @flofreud , @dildik and myself are in the same group/project;)

damlabeyaz commented 11 years ago

...no, we have two lexer groups. I am in the javabite lexer group, together with @fb13 and @akrillo89 and each lexer group should have one person representing this group. And these two persons should decide together about the interface.

flofreud commented 11 years ago

@dildik Im in your group and try to cross-reference to related issues. I overlooked the remark for the lexer and thought the propose a different design, but they dont.

flofreud commented 11 years ago

@dildik Im also in the Javabite group and the interface has do be defined in respect to concerns of Parser and overall architecture, but this is no issue, because all are on same position for the calling, it seems

damlabeyaz commented 11 years ago

Ok. Thus the getNextToken design is decided? I am a bit confused, too :P The last problem is the management of the symbol table. But I think, the interface doesn't give any information about the table, so that the interface can be decided independent of the symbol table management...?!

Tkrauss commented 11 years ago

@dildik I don't know wether it's decided... Maybe it's smart if you open a new issue, assign it to @EsGeh ( he's the representant from group two for lexer/parser) and you both will make a decision? We don't need the whole groups to be involved in this in my opinion... .

@dildik again, did you understand my worries about the symbol table at the lexer? Maybe it will make sense to remove the interaction between the lexer and the symbol table completely so that the parser will take care of things like symbol referring in the correct block context

damlabeyaz commented 11 years ago

The problem is, that the interfaces should be decided until tomorrow. Therefore, I cannot wait. I uploaded our interface proposals in the src directory under lexer. I already defined a proposal Token class.

flofreud commented 11 years ago

@dildik For what is the whitespace token needed? I think the grammar need whitespaces only to seperate tokens (and the lexer can ignore multiple)

jvf commented 11 years ago

the fuc group - when we discussed this last thursday - was indeed in favor of a getNextToken() method to make the design as usable as possible for the lecture

fb13 commented 11 years ago

@jvf Right. We should stick as much as possible to the lecture book. And as I remember the getNextToken() method is the one Fehr used in her lecture.

Tkrauss commented 11 years ago

:+1: to the getNextToken()-Method. There seems to be a common sense about the definition of signed numbers i dont't support unlimited... In the grammar exists the unary "-"-operator. Combined with the alternative signing of a int, we can interpret "-2" as two tokens ( - and 2) or as one token (-2), dependent if we see it as a signed Int or negated unsigned int. Thus we have a ambiguity we have to remove asap imho. One Solution can be to allow just unsigned reals/ints/etc.

EsGeh commented 11 years ago

A few thoughts:

Thoughts about getNextToken or whole TokenStream at once: a) getNextToken: We should also consider that there will be a view class which visualizes a token-stream. If we choose the getNextToken() I am not shure how this can be achieved. It will add an amount of heterogenity to our design. We can not use the same solution like for the other modules in the toolchain, right? If we choose this method, I propose to open a new wiki-Page and corresponding issue, to find a method how the communication between Lexer, Parser, and Tokenstream-Visualizer is supposed to work. b) whole tokenStream at once: I think one pro for the design, where we return a whole token list at once is simplicity. We could use the same solution for visualisation like for other modules in the toolchain.
Shouldn't the Lexer be able to parse a numeric constant like "23" into an int? If so, the current design is inappropriate, because a Token can not carry an integer (23, in the example), but only strings. That is why the lecture introduced a "Klassenhierarchie für Token". Flofreud already mentioned this point, but no one seems to care. I think it's an important point. (If we agree on using the Token hierarchy, I can make a new draft which uses one.)

EsGeh commented 11 years ago

3 . The Lexer needs to be able to make entries into the symbol table hierarchy, right? So it needs to hold a reference on it. It also needs a method "setSymbolTable(SymbolTable table)" to set the currently "active" symbol table (see Symbol Table wiki). Again, if you agree, I can make a new design which includes it.

(shit, github recognizes enumerations, and starts them from 1. :-P)

Tkrauss commented 11 years ago

My opinion:

It's pretty simple to convert the methods. Maybe the best Solution is to support both. The lexer then supports getNextToken() and getTokenList(), i don't think it will be difficult to implement both assuming an adequate implementation. Finally this will meet both requirements with just a little overhead.
The task of the lexer is just to tokenize the sourcecode. Even there are fine granulated definitions of numbers etc., it's none of the lexers business to know what the string "123" means. As you already said, the lexer is not able to PARSE a numeric constant, that's the job of the PARSEr ;) 1'. Since the source language supports multiple scopes with multiple definitions, this may also be a thing the parser should care about. if you have two blocks :"{int a}{bool a}", the symbol of a in the second block has no meaning before declaring it as a boolean variable. Thus the symbol table entries entered by the lexer can be wrong or misleading. I think its the best idea to return just the tokenStream ( or single tokens like i proposed in 1 ) and let the symbol table be an issue of the parser/semantic analysis.

EsGeh commented 11 years ago

2 . I think it would be nice, if the lexer could do that. If you recall the lecture, there was a step saying "Erkenne Konstanten". I am not insisting on this, but still I think it would be nice to let the Lexer do that job, and let the Parser concentrate on treating with the aspects of the grammar only. Further opinions?

Tkrauss commented 11 years ago

If we want to do so, we have to distinguish between the different classes. Keeping just the one "Token" class won't be appropriate. My idea: Use of generics. There is the one interface "Token" which is parametrized by the generic parameter V. Since a Token has always a string representation, there is a method "string getValue()" specified. Plus the method "T getExactValue()" returns a Object of the generic type. So there would be a class "StringToken implements Token" which contains Tokens like "do, while,{, ...". Another class would be "RealToken implements Token". In that case we have everything we have all information we need and the parser can "peek" the information it needs.

jvf commented 11 years ago

I agree with @EsGeh, it is most definitely part of the Lexer to recognize numeric tokes and calculate their numeric value during lexing. If you look and the examples from the dragon book, the Tokens num and real have integer and float values as lexemes, not strings, and the numeric values are calculated during lexing (see Chapter 2 and Apendix A from the dragon book).
In the dragon book, the unary minus is handled as a token with the name (tag) MINUS. That would imply in my opinion, that the lexer only produces positive values (i.e. unsigned) (while the source language supports signed numeric constans), and it is part of a later stage (probably the IR code generation) to convert the tokens MINUS + numeric constant to a signed numeric constant.
Concerning getNextToken and visualization: I agree, it would be easier for the visualization to have a complete token stream to visualize. But i think it is more important to stick to the design from the lecture, as our compiler is supposed to be used in the lecture, so the visualization has to work around this problem.

flofreud commented 11 years ago

It should be no problems to let the visualization use another getNextToken-Lexer-instance and pull all tokens from it. The concerns of visualization should only change the lexer if really necessary and i do not see why it cannot loop the tokens from an lexer instance. If visualization is used, the lexer is most likely used separately from other modules.

fb13 commented 11 years ago

@dildik and I just made an uml proposal for the lexer.

See https://github.com/swp-uebersetzerbau-ss13/common/wiki/Lexer.

We think it isnt necessary to have such "RealToken", "StringToken", etc. classes. We just have the Token class, that has methods to identify the type of the token. We return a generic value and the parser can identify the token type with a method getTokenType();

flofreud commented 11 years ago

@fb13 and @dildik : Please look into #5 where we conclude that the - before a number should be a token itself.

More a detail thing but in the grammar discussion the E-notation was added for writing bigger numbers and we probably should only use dot and not dot or comma as separator. Also the Tokens for string and id can consists of more than one character and the id shouldn't be starting with a number.

jvf commented 11 years ago

[...] it is most definitely part of the Lexer to recognize numeric tokes and calculate their numeric value during lexing. If you look and the examples from the dragon book, the Tokens num and real have integer and float values as lexemes, not strings, and the numeric values are calculated during lexing (see Chapter 2 and Apendix A from the dragon book).

As stated above, i dont think String is sufficient as value for tokens!

damlabeyaz commented 11 years ago

As stated above, i dont think String is sufficient as value for tokens!

You mean, that we need special kinds of tokens? In the appendix of the dragon book, I see that string tokens and num tokens are separated. But why cannot we just have one token class and a generic type as value?

flofreud commented 11 years ago

But why cannot we just have one token class and a generic type as value?

Generics are not helpful in this case. Generic types in Java are a compile-time feature. The type information is not accessible in runtime. We need the type information to runtime and the static type of the field would be Object for a generic without type extension. The compiler cannot interfere the generic type information for our case so we need to cast explicitly. We could also use a Object field and would cast by token type to the correct inner type instead casting the Token-object to the correct token type.

Imho the casting of tokens seems more cleanly in design as the casting of the object field. I would like a token hierarchy for num, real, bool and string (which covers all the rest).

Tkrauss commented 11 years ago

My idea: Use of generics. There is the one interface "Token" which is parametrized by the generic parameter V. Since a Token has always a string representation, there is a method "string getValue()" specified. Plus the method "T getExactValue()" returns a Object of the generic type. So there would be a class "StringToken implements Token" which contains Tokens like "do, while,{, ...". Another class would be "RealToken implements Token". In that case we have everything we have all information we need and the parser can "peek" the information it needs.

Just repeating my previous idea... Imho this design will do the job.

bgrad commented 11 years ago

I think, it's better to use only one special Token class, instead of an Interface and Generics. We can implement the return type of the getValue method as an Object, so we can put String, Real, Num, Int Objects in it. And the type of the value is stored as TokenType, so we know, what cast is necessary.

jvf commented 11 years ago

Quite frankly i don't get the fuss. Why not just a (small) token hierarchy, with a base token class and num, real etc classes extending it? Why use an OO language, if not using it?

fb13 commented 11 years ago

Instead of creating an own ENUM type for "+" and "-" we could use ARITHMOP. For example:

-2

would result in

<ARITHMOP, "-", loc, coc><NUM, "2", loc, coc>

What do you think?

@kensan83 In fact I had the same idea, but in the "Dragonbook" (:D) they distinguish between token types, so we should do this as well...

fuc-thomas commented 11 years ago

@fb13 take a look: https://github.com/swp-uebersetzerbau-ss13/common/blob/master/interfaces/src/swp_compiler_ss13/common/lexer/lexer.jpeg

fb13 commented 11 years ago

@fuc-thomas Thanks, but this is exactly the uml that @dildik and I created and uploaded. So what? It isnt correct, because we are missing Tokens for num, real, string and boolean, as this discussions shows. Also it isnt clear, whether we should use an ARITHMOP or just create a SIGNOP for cases like "+3" or "-5".

bgrad commented 11 years ago

@jvf, @fb Well, then let us use a token hierarchy.

EsGeh commented 11 years ago

added a proposal in the repo, under common/doc/lexer. lexerInterface

It is not yet complete.

fb13 commented 11 years ago

@EsGeh Why do we need a class LocationInSource, when we could just place those information in the token itself. As the discussion showed, most people here would prefer NUM, STRING, REAL and BOOL-Tokens, why do you propose NUM, ID and SIMPLE?

@dildik just uploaded a new version of our uml:

New UML model

EsGeh commented 11 years ago

Well. Real, and String are supposed to be added. Anyone feel free to do so. As far as I know, from the view of the parser, the tokens should represent the terminal symbols of the grammar. In the grammar there is no terminal "bool", but 2 terminals called "true", and "false". Since both do not carry additional information, SimpleToken can be used to represent them.

EsGeh commented 11 years ago

By the way, I don't think the terminal definitions (by which I mean the regular expressions/definitions) should be part of the interface, since other components don't need to know them.

akrillo89 commented 11 years ago

We removed the "," seperator and added a num and real type for exponential notation. lexer

what do you think about that?

flofreud commented 11 years ago

For Strings you should make a begin and end symbol ("?) and between these are all (ascii?) characters allowed, you could get the mightiness of non-printable statements through java easily but i dont think that the matching for "..." is possible with regex if \" should be allowed. If we want a regex lexer possible than we could match with "[^"]*" but probably my assumption is wrong.

You still have to remove the (+|-) from the tokens. The unary minus need to be handles as an extra token (mentioned by parser/semantic group).

On Tue, Apr 23, 2013 at 4:12 PM, Sebastian notifications@github.com wrote:

We removed the "," seperator and added a num and real type for exponential notation. [image: lexer]https://f.cloud.github.com/assets/2676770/414687/c01efc88-ac1f-11e2-864a-40bc41ba1018.jpeg

what you think about that?

— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16860346 .

akrillo89 commented 11 years ago

Well, what do you think about the decision to remove "-" from UNARYOP? This solves the overlap of UNARYOP and all regexp with " - ".

flofreud commented 11 years ago

We cant change the grammar. Why should there the minus in regex for the numbers? Its unnecessary.

On Tue, Apr 23, 2013 at 4:46 PM, Sebastian notifications@github.com wrote:

Well, what do you think about the decision to remove "-" from UNARYOP? This solves the overlap of UNARYOP and all regexp with " - ".

— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16862544 .

swp-uebersetzerbau-ss13 / common

Lexer Specification #3