inspirer / textmapper

Lexer and Parser generator
http://textmapper.org
MIT License
110 stars 25 forks source link

tm-go: avoid collision of token names for character literals; error: '|' and 'or' get the same ID in generated code #31

Open mewmew opened 5 years ago

mewmew commented 5 years ago

~I tried running the latest version of Textmapper today (the Go version at rev 9261aa298dfe573c6985cc2b92eed8427f202f61).~

Doing so, I received the following error for the grammar ll.tm:

u@x1 ~/D/g/s/g/l/ll2> ~/goget/bin/textmapper generate ll.tm 
ll.tm:576:1: '|' and 'or' get the same ID in generated code

This is because the LLVM IR grammar contains both | and or as two distinct keywords/tokens.

Edit: My suggestion would be to use token.Pipe for |, however, I also realize that a more general solution may be required.

Edit2: @inspirer, any idea what the right approach would be to resolve this?

I update to Textmapper rev cd73b6bdd3ea947ee86f68ae4dc8b371660e20a3 today and the issue is still present.

$ textmapper generate ll.tm
ll.tm:582:1: '|' and 'or' get the same ID in generated code

Would it make sense to add a prefix to these generated token IDs to avoid collision? E.g. | -> Character_Or (or Character_Pipe)?

inspirer commented 5 years ago

Commented on the pull request https://github.com/inspirer/textmapper/pull/35#issuecomment-557939771

mewmew commented 4 years ago

For reference to others coming across this issue, @inspirer wrote a detailed response to this issue in PR #35, which is included below:

Sorry for not getting to this earlier. I agree that we should find a solution to #31 but I find adding prefixes to all non-identifier tokens too harsh. I was thinking more in the direction of letting the user to specify the generated identifier inside the grammar but let me explain:

In my experience there are pretty much 6 types of tokens in all languages: keywords, punctuation, literals, identifiers, whitespaces, and comments. The first two types look best in a grammar if they are named after their matching text in single quotes. The rest use normal identifiers which can easily be renamed. All of them need to be unambiguously converted into uppercase identifiers in generated code. The biggest conflict typically takes place between keywords and punctuation. You propose adding a prefix to all punctuation tokens (char or tok), while a pretty much symmetrical solution could be adding a prefix to all the keywords (say kw_).

If we go from use cases, a pretty prominent one is building a recursive-descent parser on top of a Textmapper-generated lexer. There are some languages such as Scala and Kotlin, which were created without a reference LALR(1) grammar, and it is almost impossible to express them in lalr (they are greedy in some places and too chaotic/ad-hoc in other places). If you are coding such a parser, you reference tokens a lot. It will be a pain to write TOK_SEMICOLON instead of just SEMICOLON every single time.

So, my requirements/design priorities for the Go version are:

1. One should be able to guess a generated name just by looking at the original name (most of the time).

2. Explicit is better than implicit.

3. Textmapper should not attempt to resolve conflicts for the user but rather flag them at compile time (the Java version used to silently add numbers at the end of an identifier).

4. During the rewrite/transition period, identifiers generated by Go should 99% match identifiers generated by Java. We can deviate and have a flag controlling this behavior, and can flip it for new grammars + once the rewrite is over.

I see several solutions for this problem (and would probably implement a combination of 1 and 3):

1. Let the user specify the generated identifier for each terminal symbol. There are two flavors of this:

a) Specify it right in the token rule: 'or' (KW_OR) : /or/ '|' (OP_OR) : /|/

b) Make it part of the options: customIDs = ['or' => "KW_OR", '|' => "OP_OR"]

The disadvantage of a) is that you will have to repeat it in each rule if there are more than one but since we are mostly talking about punctuation vs keywords - those typically have exactly one rule per token, so I think I prefer a).

1. Always prefix quoted names with something (kw_ for identifier, tok_ for the rest, maybe even op_ for <>*/-+etc)

2. Disallow conflicts between quoted and normal names at compile time. Solve keyword vs auto-generated punctuation names by adding prefixes. Don't add prefixes for unambiguous identifiers.

'': // # gets MULT 'or': /or/ # gets KW_OR '|': /|/ # gets TOK_OR

Thoughts?

(I'll include my response below too, to have the conversation in one place before closing #35).

Hi Evgeny!

Thanks the taking this issue seriously and giving it potential solutions to the problem thorough thoughts.

I agree with you in large, and would not mind at all if there was an option for grammar writers to be explicit about the token name to handle disambiguation. I also think it makes sense to follow the Go convention of being explicit about things.

As such, I think it would be a good solution to have tm report an error for disambiguous tokens, as it already does, and then let the user disambiguate them. I think it would make sense to update the error message to help the user along, either with a direct link to docs, or a short sentence along the lines of:

ll.tm:576:1: '|' and 'or' get the same ID in generated code;
   to disambiguate tokens, assign an explicit ID (see https://textmapper.org/documentation.html#lexer-explicit-token-id)

So, this would correspond to your proposed 1a solution.

a) Specify it right in the token rule:
'or' (KW_OR) : /or/
'|' (OP_OR) : /|/

I would prefer 1a over 1b, as it lets you see all your mappings and token definitions in one place, but if it makes more sense to use 1b (i.e. no need to extend the grammar of tm) then I would understand.

Excited to see tm making good progress.

Do you have any place tracking the rough set of items left to complete for the Go parser generator?

I'd love to take part of such a roadmap.

Cheerful regards, Robin