Lexer Specification - Githubissues

flofreud commented 11 years ago

Comments and Discussion for Lexer See Wiki page: Lexer

EsGeh commented 11 years ago

I have a few questions, on the interface design, proposed by flofreud & Co.

What do the methods "Token.getValue()", "Token.getExactValue()" do? What is the return Value "T" of "getExactValue()"? if this should define an interface it would be nice to explain that.
What is the sense of deriving the Token-Class if the children do not provide any specialisation?
What do the regular expressions for the terminals have to do with the lexer INTERFACE?

Tkrauss commented 11 years ago

@flofreud correct me if i'm wrong:

The Token class has a generic parameter T.
They are special in the case of defining T. extending Token resp. ( Realtoken with Double, NumToken with Int etc.)
Don't get your point. there is no regexp in the lexer interface, there is just a regexp representation for each token... because a token is defined by a regexp.

akrillo89 commented 11 years ago

a few updates, thanks to @flofreud lexer

flofreud commented 11 years ago

I think the regex, if present, should be a comment, not part of the enumeration. I'm not sure if the lexer can be regex based for the language if we would define the string equal to Java. The token-definition is part of the terminal definition for the grammer where some input is missing (see https://github.com/swp-uebersetzerbau-ss13/common/wiki/Grammar)

On Tue, Apr 23, 2013 at 5:24 PM, Tkrauss notifications@github.com wrote:

@flofreud https://github.com/flofreud correct me if i'm wrong:

The Token class has a generic parameter T.

They are special in the case of defining T. extending Token resp. ( Realtoken with Double, NumToken with Int etc.)

Don't get your point. there is no regexp in the lexer interface, there is just a regexp representation for each token... because a token is defined by a regexp.

— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16865185 .

EsGeh commented 11 years ago

Thank you for the explanation. I understand the idea of the template parameter. This makes "Token" a class template, which has to be specialised for a concrete Token.

But how can I have a Token of the type ARITHMOP" then? The class template "Token" can't be used directly - do I have to specialize "Token" for every TokenType?
What would "Token.getExactValue()" return, in case of an "ARITHMOP", for example?

Have you considered taking into account the solution we proposed? I think the idea of using a class template could be merged with it.

Tkrauss commented 11 years ago

I think the idea is, that for every "boring" token like "+", "if" etc. the StringToken is used. We are able to distinguish them reading the TokenType. "getExactValue()" would return "+"... that's not of interest, but it's compatible with the idea that every Token represents a subsequence of the source.

EsGeh commented 11 years ago

Is the proposal in the repo? Since we both seem to use dia, I could make a version merging both designs

EsGeh commented 11 years ago

This is my proposal: common/doc/lexer/* lexerInterface It uses Java enum Magic, to merge the idea of having a Token hierarchy (needed to make the Lexer able to recognice numerical constants and ids) with the possibility to have a switch statement over Tokens. To decide which downcast is possible, "isNumToken", "is...Token()", ... can be used. Class derivation is used only, where it is needed.

flofreud commented 11 years ago

I cleaned the proposal a bit: lexerInterface

I remove the IdToken und StringToken interfaces because the idea was to provide a way to get certain tokens parsed into the correct type on lexer level. If the parse want to it can get always for all tokens the string representation via the getValue()-method.

What do the regular expressions for the terminals have to do with the lexer INTERFACE?

I removed them, because the definition should be stated in grammar and is not part of the interface (implementation details).

Tkrauss commented 11 years ago

Do we really need a method like "getAsNumToken()"? I've never seen such a design until now, where a regular cast is done by a convert-method of the super class...

flofreud commented 11 years ago

No, we dont need this, but i didn't wanted to discuss this over and over with deadline 2 days before.

On Wed, Apr 24, 2013 at 9:59 AM, Tkrauss notifications@github.com wrote:

Do we really need a method like "getAsNumToken()"? I've never seen such a design until now, where a regular cast is done by a convert-method of the super class...

— Reply to this email directly or view it on GitHubhttps://github.com/swp-uebersetzerbau-ss13/common/issues/3#issuecomment-16912808 .

Tkrauss commented 11 years ago

Hence i thought we'll remain with the last update of @akrillo89 , which seems to be the best imho. Anyways, i agree on the changed one, if it helps to fix the design :)

flofreud commented 11 years ago

@Tkrauss: Me too (without StringToken (unnecessary)). It contains all important facts.

+1 for last update of @akrillo89

akrillo89 commented 11 years ago

I removed StringToken because of I think you are right. In addition I removed the regexp too ( it's a part of the grammar ). lexer

EsGeh commented 11 years ago

There are several design problems that are in akrillo89's version, that I have already pointed out, and they are there in the "cleaned up" version of the merged design I proposed. For which the "Token.getTypedValue()" just makes no sense. e.g. the token <+>. What is "Token.getTypedValue()" supposed to return for this token? the same as Token.getName()? (which would return "+" as far as I understood). This would result in this situation: < + >.getValue() == "+" < + >.getTypedValue() == "+" but also < num, 7 >.getValue() == "num" < num, 7 >.getTypedValue() == 7 Also the method "Token.getValue()" is not necessary, because it just doubles the method "TokenType.getName()". With the design I proposed it is like this: < + >.getType().getName() == "+" and for tokens with an associated value (like num): < num, 7 >.getType().getName() == "num" < num, 7 >.getValue() == 7

flofreud commented 11 years ago

GetValue() returns the readed lexem: < num, 7 >.getValue() == "7" < num, 7 >.getTypedValue() == 7

For all types with non specialed interface both methods return the string representation: < string, 'foo'>.getValue() == "foo" < string, 'foo'>

For tokens like ARITHMOP this would be < ARITHMOP, '+'>.getValue() == "+" < ARITHMOP, '+'>.getTypedValue() == "+"

akrillo89 commented 11 years ago

@EsGeh can you explain why you prefere TokenType.getName() compared to Token.getValue()

flofreud commented 11 years ago

These would be to different things. TokenType.getName() on STRING-TokenType would be 'string' and getValue() would result in the readed string.

I think there was a misunderstanding about the meaning of the methods because we have no javadoc but these "cool" diagrams to discuss about the design.

EsGeh commented 11 years ago

how can I declare a list of Tokens?? "Token", if defined as a template class cannot play the role of a base for all tokens, as far as I know list < Token > is not possible. next try: list < Token < ? > > hmm. I am not shure, if this would be legal in Java. But it is too general anyway. The list could contain a Token < File > , which makes no sense at all.

One can use a generic class when IMPLEMENTING the interfaces, but the interfaces themselves, I think it is better to avoid generics. You use interfaces to specify the behavior of classes. If you have generic interfaces, they tend to have to little restrictions, because you can not know from the interface, which type the type parameter has to have.

EsGeh commented 11 years ago

I think it would be nicer, to have different Enum values for "+","-","*", and "/". While parsing, it is possible that one is in the situation that he wants to find out, wether a Token is a <+>. Would be nice to be able to do: if(token.getType() == PLUS) ... The enum type could give you all information, about which kind of token. There still should be a way to find out which downcast is available. Therefore I proposed the methods of the enumeration "TokenType.isNumToken()", ....

akrillo89 commented 11 years ago

I think the difference is

if(token.getType() == PLUS)

or

if(token.getType() == ARITHMOP && token.getValue().equals("+"))

There is not such a big profit. It would be very confusing if we create for every subtype a TokenType

EsGeh commented 11 years ago

@ flofreud, the version in the repo looks quiet good to me. I think it still misses a way to decide which downcast is available for a specific Token

could be made part of the enum type (enums in Java can nearly act like classes). It's sensible, because it can be derived from the enum value
could be made part of the Token class.

I think uml is quiet good, to discuss interface designs.

Tkrauss commented 11 years ago

Well.. i don't get your point, so i have to ask if you know the instanceof-operator? It tells you exact the information you are trying to plug in the enum class... namely the information if a cast is possible...

flofreud commented 11 years ago

I think Javas instanceof is well usable for this. The instanceOf NumToken is only usefull for TokenType Num. I see your point to make this information explicit, but cant image where it is needed by the parser group because the interface convention is to ask for the type only if needed. To provide the information in TokenType we would have to define for every TokenKind three boolean in construction without a really need for this.

case ID:
....
token..getValue()
...
break;
case NUM:
....
if (token instanceof NumToken)
NumToken nt = (NumToken) token;
token.getLongValue();
...
break;

You could also assume the lexer implements correctly and cast directly.

akrillo89 commented 11 years ago

lexer

swp-uebersetzerbau-ss13 / common

Lexer Specification #3