inspirer / textmapper

Lexer and Parser generator
http://textmapper.org
MIT License
110 stars 25 forks source link

Always prefix tokens containing non-ASCII characters with "char". #35

Closed mewmew closed 4 years ago

mewmew commented 5 years ago

This helps avoid name collisions in the generated Go source code.

Given a grammar with the two tokens 'or' and '|', the textmapper now produces the following Go source code.

OR      // or
CHAR_OR // |

Note, this same method is used also to distinguish multi-character tokens containing non-ASCII characters, e.g. a '...' token is distinguished from a 'dotdotdot' token as per:

DOTDOTDOT      // dotdotdot
CHAR_DOTDOTDOT // ...

Fixes #31.

mewmew commented 5 years ago

After some thought, I think this PR needs more work before being merged.

In particular, it should prefix all non-identifiers with say "char". Currently, it only prefixes the set of identifiers containing any of the non-letter ASCII characters of the charName map.

Edit: rev https://github.com/inspirer/textmapper/pull/35/commits/9322fedb6f6e1e5cb9df1d3f1cc43777c61cea9a fixes the above-mentioned issue. Now, all tokens which are not valid Go identifiers are prefixed with "char".

Whether the char prefix should be used, or to select another one is to be determined. Perhaps tok makes more sense now, as it also handles multi-character tokens?

inspirer commented 5 years ago

Sorry for not getting to this earlier. I agree that we should find a solution to #31 but I find adding prefixes to all non-identifier tokens too harsh. I was thinking more in the direction of letting the user to specify the generated identifier inside the grammar but let me explain:

In my experience there are pretty much 6 types of tokens in all languages: keywords, punctuation, literals, identifiers, whitespaces, and comments. The first two types look best in a grammar if they are named after their matching text in single quotes. The rest use normal identifiers which can easily be renamed. All of them need to be unambiguously converted into uppercase identifiers in generated code. The biggest conflict typically takes place between keywords and punctuation. You propose adding a prefix to all punctuation tokens (char or tok), while a pretty much symmetrical solution could be adding a prefix to all the keywords (say kw_).

If we go from use cases, a pretty prominent one is building a recursive-descent parser on top of a Textmapper-generated lexer. There are some languages such as Scala and Kotlin, which were created without a reference LALR(1) grammar, and it is almost impossible to express them in lalr (they are greedy in some places and too chaotic/ad-hoc in other places). If you are coding such a parser, you reference tokens a lot. It will be a pain to write TOK_SEMICOLON instead of just SEMICOLON every single time.

So, my requirements/design priorities for the Go version are:

  1. One should be able to guess a generated name just by looking at the original name (most of the time).
  2. Explicit is better than implicit.
  3. Textmapper should not attempt to resolve conflicts for the user but rather flag them at compile time (the Java version used to silently add numbers at the end of an identifier).
  4. During the rewrite/transition period, identifiers generated by Go should 99% match identifiers generated by Java. We can deviate and have a flag controlling this behavior, and can flip it for new grammars + once the rewrite is over.

I see several solutions for this problem (and would probably implement a combination of 1 and 3):

  1. Let the user specify the generated identifier for each terminal symbol. There are two flavors of this:

    a) Specify it right in the token rule: 'or' (KW_OR) : /or/ '|' (OP_OR) : /|/

    b) Make it part of the options: customIDs = ['or' => "KW_OR", '|' => "OP_OR"]

The disadvantage of a) is that you will have to repeat it in each rule if there are more than one but since we are mostly talking about punctuation vs keywords - those typically have exactly one rule per token, so I think I prefer a).

  1. Always prefix quoted names with something (kw for identifier, tok for the rest, maybe even op_ for <>*/-+etc)

  2. Disallow conflicts between quoted and normal names at compile time. Solve keyword vs auto-generated punctuation names by adding prefixes. Don't add prefixes for unambiguous identifiers.

'': /\/ # gets MULT 'or': /or/ # gets KW_OR '|': /|/ # gets TOK_OR

Thoughts?

mewmew commented 5 years ago

Hi Evgeny!

Thanks the taking this issue seriously and giving it potential solutions to the problem thorough thoughts.

I agree with you in large, and would not mind at all if there was an option for grammar writers to be explicit about the token name to handle disambiguation. I also think it makes sense to follow the Go convention of being explicit about things.

As such, I think it would be a good solution to have tm report an error for disambiguous tokens, as it already does, and then let the user disambiguate them. I think it would make sense to update the error message to help the user along, either with a direct link to docs, or a short sentence along the lines of:

ll.tm:576:1: '|' and 'or' get the same ID in generated code;
   to disambiguate tokens, assign an explicit ID (see https://textmapper.org/documentation.html#lexer-explicit-token-id)

So, this would correspond to your proposed 1a solution.

a) Specify it right in the token rule:
'or' (KW_OR) : /or/
'|' (OP_OR) : /|/

I would prefer 1a over 1b, as it lets you see all your mappings and token definitions in one place, but if it makes more sense to use 1b (i.e. no need to extend the grammar of tm) then I would understand.

Excited to see tm making good progress.

Do you have any place tracking the rough set of items left to complete for the Go parser generator?

I'd love to take part of such a roadmap.

Cheerful regards, Robin

mewmew commented 4 years ago

The proposed solution in 1a is a better way to solve this issue than the one implemented in this PR. Thus, closing this PR in favour of 1a.