kwshi / tree-sitter-hy

Tree-sitter grammar for Hy, a Lisp-ification of Python.
7 stars 1 forks source link

General Suggestions #1

Open alexmozaidze opened 9 months ago

alexmozaidze commented 9 months ago

[!WARNING]\ Tree-sitter-fennel was reworked some time ago, which is not considered in these suggestions. These suggestions are good for lightweight parsers, but are not as great for full-blown ones. See the update below.

Henlo! I made tree-sitter-fennel and I have some suggestions regarding tree-sitter-hy.

Explicit whitespace control

In Lisp, whitespace can only occur in the root program, or in list-likes (arrays, tables, etc). Controlling the whitespace (AKA emptying extras) comes with a benefit of being able to easily implement reader macros. You only define list-likes once and never again touch them.

tree-sitter-clojure and tree-sitter-fennel manually manage whitespace, but if you need some inspiration with a grammar that doesn't, take a look at tree-sitter-janet-simple.

Override symbols

In Lisps there may be some characters that aren't allowed to appear in symbols, but with exceptions! Those exceptions can be expressed as a different token that is aliased to symbol:

_special_override_symbols: $ => alias(choice(
    "#",
    // ...
), $.symbol),

In some cases, the override symbols may match where it shouldn't (like in keywords :#). In that case, you can either apply dynamic precedence + prec.right (for additional lookahead), or put those special override symbols in an array and re-use them in the token where this case occurs, which is more hacky, but avoids dynamic precedence.

For the second approach, take a look at tree-sitter-fennel.

Node naming

Symbols used inside dotted symbols should be aliased to (symbol_fragment) or something similar. Standalone symbols and multi-symbols should be distinguishable on a node level. That specifically fixes a bug where an expression (foo.bar) is colored as a variable instead of a function call, since a query like (symbol) @variable also matches symbols inside dotted symbols.

Fields

Use fields in as many places as you can. This may not seem like much, but fielded nodes are the only ones that you can match the absence of when querying. See "Negated Fields" section at https://tree-sitter.github.io/tree-sitter/using-parsers#query-syntax

Queries, queries, queries...

Those are just some of the suggestions on querying. Also, I use Neovim, so everything I suggest here is directed towards writing highlight queries for Neovim.

Key/value matching in list-likes

If you ever have a case of a macro/form using key/value pairs inside continuous structures, you can take a look at the following query template: https://github.com/tree-sitter/tree-sitter/discussions/3043

Keyword highlight query ordering

Separate capturing keywords in one query, and expand upon individual elements in a different one.

((symbol) @keyword
  (#any-of? @keyword "fn" "defun"))

;; ...

(expression
  .
  (symbol) @keyword
  (#any-of? @keyword "fn" "defun") ; there will be no highlight overlap
  .
  ;; ...
  )

This will ensure that there is no redefinition of highlights and the ordering of said highlights stays consistent.


Fennel and Hy seem to have a lot of similarities, so I reckon a lot of stuff from tree-sitter-fennel can be reused in tree-sitter-hy.

kwshi commented 9 months ago

Whoa, thanks so much for taking the time to write out these suggestions! I'm swamped with a bunch of schoolwork this week so I won't be able to work on this immediately, but when I get a chance I'll definitely sit down and go through these.

alexmozaidze commented 5 months ago

I've reworked tree-sitter-fennel to use an external scanner instead of manual whitespace, and have some things to add now that I'm more experienced:

Manual whitespace is good, but may be tedious. If you are planning to make a full-blown parser with support for all the built-in constructs in the language (let, for, etc.), then it's best to do so with whitespace managed by Tree-sitter.

You can easily implement reader macros with a simple external scanner. In a nutshell, it's just checking whether or not there's a space in front of our reader macro character (#reader-macro vs # not-reader-macro).

There may be some edge-cases like :this-is-a-keyword#this-is-a-reader-macro, but it's not that difficult to deal with once you understand the basic concept, just experiment a bit and find what works through trial and error. You could look at tree-sitter-fennel's scanner.c to see how I solved these edge-cases.

Also, I wrote a document detailing some reasons why it may be more preferable to bake forms into the grammar instead of relying on queries.