tree-sitter / tree-sitter-javascript

Javascript grammar for tree-sitter
MIT License
323 stars 109 forks source link

Operators don’t have productions #7

Closed robrix closed 6 years ago

robrix commented 8 years ago

rel_op &c. don’t have productions for the actual operators, which means that you have to inspect the source to figure out which operation the node implies.

maxbrunsfeld commented 8 years ago

I think I did it this way as something of a premature optimization, to make the parse table smaller. It seems reasonable to just have a production for every single operator though. It will make the parse table a bit bigger, but I'm not sure how significant the change will be.

maxbrunsfeld commented 8 years ago

I just thought about this some more. I think the current parse tree should give you all of the information you need for semantic analysis, because there are already distinct node types for each operator. To access the operators themselves, you'd need to use the ts_node_child API, as opposed to ts_node_named_child.

For example, if you were building some AST structure using the JS API, you could do this:

function visitRelationalExpression(node) {
  assert(node.namedChildren.length == 2);
  assert(node.children.length == 3);

  // In C this would be `ts_node_type(ts_node_child(node, 1), document)`
  switch (node.children[1].type) {
    case '!==':
      return NotEquals(
        visitExpression(node.namedChildren[0]),
        visitExpression(node.namedChildren[1])
      );

    case '===':
      return EqualsExpression(
        visitExpression(node.namedChildren[0]),
        visitExpression(node.namedChildren[1])
      );

    // ...
  }
}

Right now, there are some expression types like math_op that can be unary or binary, so you would need to check their child count first, but that doesn't seem too inconvenient.

Given that there is already a distinct node type for each operator, I don't really think it makes sense to have another parallel set of node types for the expressions containing each operator. @robrix thoughts?

robrix commented 7 years ago

If we don’t have named productions for the operators, we end up having to special case a bunch of stuff to get the tokens—“if the language is JS and the production name is x then look at the second token, if it’s JS & the production name is y, then look at the first”, etc. 😕

I’d much rather the grammar just give us a named production for it so we can pull the operators in without coupling any more logic to the grammar’s structure.

The alternative on our end is to pull in every child, named or otherwise, regardless of where it occurs in the grammar, but we’re concerned about the performance implications of adding that many more nodes to the tree. Worse, there’s the semantic concerns of which nodes actually have any real significance to our analyses, and that’s much harder for us to tell in a language-agnostic manner. So this would really just move the special-casing around, while also slowing us down.

The rule of thumb I’d like to propose is that every token of semantic significance should be named.

maxbrunsfeld commented 6 years ago

It seems like the current paradigm (keeping the parent nodes generic and identifying the operators by looking at the unnamed nodes) is working ok for now, and it allows the parsers to have a much smaller static memory footprint.

We've made it easier now by replacing all the *_operator rules with just binary_expression and unary_expression so that it's always clear which child is the operator. @robrix You ok w/ closing this one out?