distinguish `identifier` from word token

brandonspark commented 1 year ago

What

This PR simply isolates identifier into two different nonterminals, identifier and word_identifier. Both have the same content, so this does not change the meaning of the grammar.

Why

I am a program analysis engineer at Semgrep, and one thing we do quite often is manipulate tree-sitter grammars so that we can parse slightly modified grammars of programming languages, usually by extending certain constructs.

A perennial problem that has been happening for the Julia language is that Semgrep metavariables cannot always be put in the same place as identifiers, even though this is desirable (and intuitive) behavior for Semgrep users. Our usual method for ensuring this behavior is to overload identifier to be able to match a Semgrep metavariable, but this presents an issue for several reasons: 1) identifiers can appear wherever an interpolation expression can, but interpolation expressions cannot appear anywhere an identifier does. This means that overloading interpolation_expression (which we are currently doing) is not general enough. We could change the grammar, but this would require hard-coded anywhere an interpolation_expression appears to allow an identifier, which would be many places. It's better to go to the identifier nonterminal itself. 2) trying to make identifier possibly match a Semgrep metavariable regex can produce a partial match in the middle of such an identifier, for instance in the interpolation expression $Pack. This is an interpolation expression, but augmenting identifier in this way will cause it to parse as a metavariable $P, followed by an identifier ack. This persists because tree-sitter does not support regex assertions, including word boundary assertions. 3) Another approach is to make identifier possibly a choice of an external token, which would allow us to take care of the distinction in the scanner. This would ordinarily be possible, with the exception of the fact that the identifier symbol is actually also what is known as the word token in tree-sitter ( see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction ). This is coded in the logic of tree-sitter to be unable to be a nonterminal, meaning that it is impossible to augment identifier with another nonterminal.

How:

Of these solutions, it is easiest to fix 3. This PR simply separates out identifier into word_identifier, a symbol that is meant to be used as the word token, and identifier, which is set to be the same as it. Notably, however, this means that Semgrep can "hack" the identifier symbol and change it, without changing the word token, thus allowing identifier to be a nonterminal of an external symbol.

That's what this PR does.

Test plan:

tree-sitter test, and the grammar should be the same.

aryx commented 1 year ago

argh, I can't even review the PR, github can now show the diff ... even if I select just the diff for grammar.js ... grrrr

brandonspark commented 1 year ago

@aryx here's a screenshot that should help:

aryx commented 1 year ago

it works now! I could see the diff.

tree-sitter / tree-sitter-julia