slackhq / tree-sitter-hack

Hack grammar for tree-sitter
MIT License
33 stars 16 forks source link

Support embedded expressions/braces in double quoted strings #26

Open Nicholas-Lin opened 3 years ago

Nicholas-Lin commented 3 years ago

Summary

This PR adds support for embedded expressions and embedded braces in double quoted strings. Note that this PR addresses a similar issue to PR #25. Notably this PR also adds support for embedded expressions and this implementation is entirely done in grammar.json (not scanner.cc).

Here are some examples of the constructs that are now supported:

"$var";
"$var[subscript]";
"$var->member";
"{$var->prop}";
"{$var->prop["key"]}";

I also added support for escape character sequences so the following examples should parse correctly:

"\$notavar";  <-- string literal
"\\\\\$notavar"; <-- string literal
"\\\{$embedexp}";  <-- should be identified as an embedded expression since the brace is escaped

Initially there were some issues with the parser incorrectly interpreting instances of #, //, /* in the string as a comment, but this should not be a problem anymore!

Requirements (place an x in each [ ])

CLAassistant commented 3 years ago

CLA assistant check
All committers have signed the CLA.

Nicholas-Lin commented 3 years ago

I noticed that embedded braces do not support scoped identifiers. For example the following test case will fail:

"{$var::get()}";

Not sure if this should be addressed in this PR or we can make a separate PR for it since this one has quite a few changes already.

frankeld commented 3 years ago

This is good progress, but there are some extended test cases that seem to break it. "Test $var->tester- Hello"; errors, but the parser should read $var->tester as embedded member selection expression.

Also, we have some inconsistency with the way we nest expression items. Consider the following test case/output:

"{$var->fun->yum}";

$var->fun->yum;
(selection_expression [5, 1] - [5, 15]
  (variable [5, 1] - [5, 5])
  (selection_expression [5, 7] - [5, 15]
    (qualified_identifier [5, 7] - [5, 10]
      (identifier [5, 7] - [5, 10]))
    (qualified_identifier [5, 12] - [5, 15]
      (identifier [5, 12] - [5, 15]))))))

(selection_expression [7, 0] - [7, 14]
  (selection_expression [7, 0] - [7, 9]
    (variable [7, 0] - [7, 4])
    (qualified_identifier [7, 6] - [7, 9]
      (identifier [7, 6] - [7, 9])))
  (qualified_identifier [7, 11] - [7, 14]
    (identifier [7, 11] - [7, 14])))))

In the case of the double quoted string, we have the variable in the level between the two selection expressions. This is incorrect, as the selection of the variable isn't against the value of fun->yum. The non-embedded version gets parsed correctly, as the leading variable identifier is in the deepest level of the nested selection. This inconsistency also happens with Heredoc variable substitution, which may be where we are inheriting it from.

cfroystad commented 3 years ago

In case it could be helpful, I've implemented string parsing for PHP in the tree-sitter-php repository. Please use whatever is useful to you: https://github.com/tree-sitter/tree-sitter-php/pull/72

aosq commented 3 years ago

Also, we have some inconsistency with the way we nest expression items

Started looking into this and you're right that the inconsistency comes from $.heredoc. I originally wrote the custom embedded braced expression rules (instead of say reusing $.call_expression, $.subscript_expression, $.selection_expression) because embedded braced expressions are restricted to expressions that start with a $.variable and to be a valid embedded braced expression there can't be a space between { and $.

Reusing existing call/subscript/selection definitions Previously, I thought reusing the existing call/subscript/selection rules would allow invalid scenarios. Thinking on this a little more I realized that's not the case: https://github.com/slackhq/tree-sitter-hack/pull/29. https://github.com/slackhq/tree-sitter-hack/blob/8ac0c52d6b5747b99f512fb9847eb9fb6eaa9946/grammar.js#L154-L164

Replacing $.embedded_brace_expression with the already defined call/subscript/selection rules fixes the issue you described for heredocs, but I think this only works because heredocs use a scanner. Don't think we could apply the same fix to $.string without a scanner.

Scanner hack One way to make the simplified version of $.embedded_brace_expression work both for heredoc and string without resorting to a scanner for string content, is to create a scanner node just for the { character of the embedded braced expression. This would allow us to use a simplified $.embedded_brace_expression but restrict the internal expressions to start with $.variable like we today for heredocs.

Fixing custom call/subscript/selecting definitions I don't see a way to do this (yet) that doesn't require some narly copy-pasting of existing definitions and modifying them further to restrict them to the embedded braced expression case.