This PR simply isolates identifier into two different nonterminals, identifier and word_identifier. Both have the same content, so this does not change the meaning of the grammar.
Why
I am a program analysis engineer at Semgrep, and one thing we do quite often is manipulate tree-sitter grammars so that we can parse slightly modified grammars of programming languages, usually by extending certain constructs.
A perennial problem that has been happening for the Julia language is that Semgrep metavariables cannot always be put in the same place as identifiers, even though this is desirable (and intuitive) behavior for Semgrep users. Our usual method for ensuring this behavior is to overload identifier to be able to match a Semgrep metavariable, but this presents an issue for several reasons:
1) identifiers can appear wherever an interpolation expression can, but interpolation expressions cannot appear anywhere an identifier does. This means that overloading interpolation_expression (which we are currently doing) is not general enough. We could change the grammar, but this would require hard-coded anywhere an interpolation_expression appears to allow an identifier, which would be many places. It's better to go to the identifier nonterminal itself.
2) trying to make identifier possibly match a Semgrep metavariable regex can produce a partial match in the middle of such an identifier, for instance in the interpolation expression $Pack. This is an interpolation expression, but augmenting identifier in this way will cause it to parse as a metavariable $P, followed by an identifier ack. This persists because tree-sitter does not support regex assertions, including word boundary assertions.
3) Another approach is to make identifier possibly a choice of an external token, which would allow us to take care of the distinction in the scanner. This would ordinarily be possible, with the exception of the fact that the identifier symbol is actually also what is known as the word token in tree-sitter ( see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction ). This is coded in the logic of tree-sitter to be unable to be a nonterminal, meaning that it is impossible to augment identifier with another nonterminal.
How:
Of these solutions, it is easiest to fix 3. This PR simply separates out identifier into word_identifier, a symbol that is meant to be used as the word token, and identifier, which is set to be the same as it. Notably, however, this means that Semgrep can "hack" the identifier symbol and change it, without changing the word token, thus allowing identifier to be a nonterminal of an external symbol.
That's what this PR does.
Test plan:
tree-sitter test, and the grammar should be the same.
What
This PR simply isolates
identifier
into two different nonterminals,identifier
andword_identifier
. Both have the same content, so this does not change the meaning of the grammar.Why
I am a program analysis engineer at Semgrep, and one thing we do quite often is manipulate tree-sitter grammars so that we can parse slightly modified grammars of programming languages, usually by extending certain constructs.
A perennial problem that has been happening for the Julia language is that Semgrep metavariables cannot always be put in the same place as identifiers, even though this is desirable (and intuitive) behavior for Semgrep users. Our usual method for ensuring this behavior is to overload
identifier
to be able to match a Semgrep metavariable, but this presents an issue for several reasons: 1) identifiers can appear wherever an interpolation expression can, but interpolation expressions cannot appear anywhere an identifier does. This means that overloadinginterpolation_expression
(which we are currently doing) is not general enough. We could change the grammar, but this would require hard-coded anywhere aninterpolation_expression
appears to allow anidentifier
, which would be many places. It's better to go to theidentifier
nonterminal itself. 2) trying to makeidentifier
possibly match a Semgrep metavariable regex can produce a partial match in the middle of such an identifier, for instance in the interpolation expression$Pack
. This is an interpolation expression, but augmentingidentifier
in this way will cause it to parse as a metavariable$P
, followed by an identifierack
. This persists becausetree-sitter
does not support regex assertions, including word boundary assertions. 3) Another approach is to makeidentifier
possibly achoice
of an external token, which would allow us to take care of the distinction in the scanner. This would ordinarily be possible, with the exception of the fact that theidentifier
symbol is actually also what is known as the word token intree-sitter
( see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction ). This is coded in the logic oftree-sitter
to be unable to be a nonterminal, meaning that it is impossible to augmentidentifier
with another nonterminal.How:
Of these solutions, it is easiest to fix 3. This PR simply separates out
identifier
intoword_identifier
, a symbol that is meant to be used as the word token, andidentifier
, which is set to be the same as it. Notably, however, this means that Semgrep can "hack" theidentifier
symbol and change it, without changing the word token, thus allowingidentifier
to be a nonterminal of an external symbol.That's what this PR does.
Test plan:
tree-sitter test
, and the grammar should be the same.