tree-sitter / tree-sitter-cli

CLI tool for creating and testing tree-sitter parsers
MIT License
45 stars 15 forks source link

Writing external rules #41

Closed Aerijo closed 5 years ago

Aerijo commented 6 years ago

I've been avoiding this for a while, but it seems like I need it now.

The question is how to write external rules? I tried looking at the JS and CLI test scanners, but learning anything from them is slow and difficult. It seems like several functions are required, and that the connection to grammar.js come from stripping outer underscores and capitalising the rule name. I would like to know if a tool was used to generate a template for these files, and/or get some guidance on what a minimal definition looks like, and some things I can do with lexer.

For what it's worth, my issue is trying to get this working

A(x) # a function
A x . P(x) # a forall statement

The A to represent ∀ is required, and I would like to avoid blacklisting function names. My idea was to use an external scanner to detect if the . is the next token after the variable. The relevant definitions are as follows

function: $ => prec.right(7, seq($.function_name, "(", $._term, ")")),

function_name: $ => /[A-Z]\w*/,

forall: $ => prec.right(2, seq($._forall_operator, $.variable, $._universal_sep, $._term)),

_forall_operator: $ => choice("A", "∀"),
_universal_sep: $ => "."

And the result of A(x)

(block [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 4]
    (ERROR [0, 1] - [0, 2])
    (variable [0, 2] - [0, 3])))

It works as expected if function is defined explicity as prec.right(7, seq("A", "(", $._term, ")")). The issue happens when the function name is made anything word starting with a capital letter.

Aerijo commented 6 years ago

OK, it looks like replacing"A" with /A/ in _forall_operator: $ => choice("A", "∀"), 'solves' the motivating issue. Coincidentally related. I'd still like to learn about externals though :)

maxbrunsfeld commented 6 years ago

Yeah, sorry that external scanners are not documented at all yet.

the connection to grammar.js...

The interface between the grammar and the scanner is based on the order of the entries in the ‘externals’ array matching the order of the entries in the C enum. There’s no other magic to it, so we just write them by hand.

As for your case, this problem seems somewhat similar to a parsing problem in the JavaScript grammar:

In a class or object literal, certain words like ‘async’, ‘get’, and ‘set’ can be used as keywords, but they can also be used as normal identifiers (property or method names).

We have a rule in the JavaScript grammar called ‘_reserved_identifier’ that helps deal with this. Essentially, we have to explicitly allow those strings as property names (as alternative choices in place of the ‘identifier’) because they will be tokenized as their own keywords, not identifiers. Take a look at the JS grammar and see if it’s helpful; we can talk through synchronously on Tuesday if it’d be helpful.

Aerijo commented 6 years ago

@maxbrunsfeld Tuesday?

maxbrunsfeld commented 6 years ago

Just because Monday is a holiday so I won’t be online.

maxbrunsfeld commented 6 years ago

I would think that switching the forall rule to a regex would just reverse the problem: previously an ‘A’ would always tokenize as the forall symbol, and now it will always tokenize as a function name. I could be missing something though.

Aerijo commented 6 years ago

@maxbrunsfeld Yup. Running into that now.

Aerijo commented 6 years ago

FYI, the _reserved_function_name trick seems to be working. I've put in a few tests now, and they all behave as desired.