Declarative treesitter-based indentation rules

lionel- commented 5 months ago

As the VS Code team is considering integrating tree-sitter in the UI thread (https://github.com/microsoft/vscode/pull/161479 and more recently https://github.com/microsoft/vscode/issues/207416) to potentially improve tokenization (https://github.com/microsoft/vscode/issues/77140) and syntax highlighting (https://github.com/microsoft/vscode/issues/50140) I'd like to suggest that indentation is another area that could be drastically improved by tree-sitter.

Currently language extensions have these possibilities when it comes to defining indentation, all with their drawbacks:

Define indentationRules and onEnterRules regexps. Because of the limitations of regexps to match programming language constructs this produces surprising indentation in many edge cases that are often not possible to fix in full generality. The python extension for instance has struggled with these quite a bit: https://github.com/microsoft/vscode-python/issues/481
Override the Enter key to bypass the regexp-based indentation, e.g.: https://marketplace.visualstudio.com/items?itemName=KevinRose.vsc-python-indent

The extension is then able to maintain knowledge of the code structure and implement an ad hoc engine.

This works but only when indenting after Enter is pressed, the indentation command is not called when moving lines up and down or pasting with autoindent. Furthermore, if the extension host is swamped there might be a noticeable delay between pressing enter and inserting the newline and indent in the editor.
Correct the indentation with a format-on-type provider for \n. This is the approach taken for python, with pylance now providing indentation corrections via format-on-type: https://github.com/microsoft/pylance-release/issues/1613.

I call these "corrections" because the indentation rules of the renderer thread are applied first and are immediately visible. The format-on-type edits are applied second and noticeable to the user if the extension host is slow to respond. That's not too bad but it's also possible for the format request to be cancelled/ignored if the user types too fast and invalidates the document version sent to the provider.

Indenting in both the renderer process and in edit corrections via format-on-type is a good compromise overall but it requires non-trivial analysis and behaviour implemented in extension code. Allowing extensions to provide smarter indentation rules aware of the tree structure of code (and with provisions for vertical alignment, see https://github.com/microsoft/vscode/issues/66235) would make the initial indentation in the UI thread much more accurate in general.

That's where tree-sitter comes in. Both Emacs and nvim provide an indentation engine specified by declarative tree-sitter queries:

Similarly, a tree-sitter declarative language for indentation in VS Code could solve the issue of complex tree-based indentation running entirely in the main process without running extension code.

alexr00 commented 5 months ago

@aiday-mar were you just working on something like this?

aiday-mar commented 5 months ago

Hi @lionel- @alexr00, thank you for posting this issue. Indeed, we are currently discussing using tree-sitter to perform the indentation. For the moment, we are trying to improve the indentation regex rules and explore the limits of these. As you mentioned, there are cases where regex indentation rules perform poorly mostly because the indentation code is not aware of the context. This is why tree-sitter could be a good alternative. One issue with using tree-sitter is that it would be computationally more expensive to do the indentation, and the indentation would be computed asynchronously (whereas currently the computation is synchronous). This implies a delay in the indentation just like with format-on-type. We will continue in-team discussion about this feature request.

lionel- commented 5 months ago

One issue with using tree-sitter is that it would be computationally more expensive to do the indentation, and the indentation would be computed asynchronously (whereas currently the computation is synchronous). This implies a delay in the indentation just like with format-on-type. We will continue in-team discussion about this feature request.

Interesting. I was hoping the incremental reparsing of TS would be fast enough to do everything synchronously. IIUC the asynchrony would be contained to the renderer process and would not be impacted by a slowed down extension host? Hopefully the delay will not be noticeable most of the time.

we are trying to improve the indentation regex rules and explore the limits of these. As you mentioned, there are cases where regex indentation rules perform poorly mostly because the indentation code is not aware of the context. This is why tree-sitter could be a good alternative.

I'd like to mention here an important use case for us (the same discussed in https://github.com/microsoft/vscode/pull/136593#issuecomment-2034162095): chains of operator-separated expressions like a && b && c. In the language R these chains (often called pipelines) are very common:

# ggplot2 pipeline
mtcars |>
  ggplot(aes(
    disp,
    drat
  )) +
  geom_point(
    aes(colour = cyl)
  )

# dplyr pipeline
starwars |>
  group_by(species) |>
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) |>
  filter(
    n > 1,
    mass > 50
  )

We are currently using OnEnter rules designed to avoid a staircase indentation effect like:

# Unwanted staircase
foo() +
  bar() +
    baz()

# Expected indentation
foo() +
  bar() +
  baz()

However these rules can only really handle one-liners, they start to fail when the pipeline elements are laid out on multiple lines:

foo() +
  bar(
    x
  ) +
    baz(
      y
    )

Even if we could match across more lines, I don't think we could really express what we need in full generality with regular expressions.

With nvim tree-sitter queries, we can express this alignment rule simply with:

[
  (binary_operator)
] @indent.begin

This should work with homogeneous left-associative operators because in that case, since we indent from the right-hand side, the parent of the current node will be the top most node in the chain of operators.

For more complex cases such as righ-associative operators, we can use the Emacs approach of expressing indentation with a matcher and an anchor. First here is how the same nvim rule as above is expressed:

((parent-is "binary_operator") parent-bol r-ts-mode-indent-offset)

This matches a child of a binary operator and indents from the parent node. To make sure that we always indent from the topmost node of a chain of binary operations, we can modify as follows:

((parent-is "binary_operator") (ancestor-while "binary_operator") r-ts-mode-indent-offset)

This uses this anchor which climbs a tree as long as the type matches and selects the start position of the last node that matched:

(defun ancestor-while (type)
  (lambda (node parent _bol &rest _)
    (while (and parent (string-match-p type (treesit-node-type parent)))
      (setq node parent)
      (setq parent (treesit-node-parent parent)))
    (treesit-node-start node)))

This example illustrates how TS rules can express indentation behaviour that approaches based on regular expressions struggle with. We will need a sufficiently flexible API to express more complex indent rules but it should be easy to extend the API with additional matchers and anchors as needed.

aiday-mar commented 5 months ago

Hi @lionel- thank you for sharing these interesting thoughts. Unfortunately, as of now I am not aware if/when we will consider changing the indentation API. I will let you know if/when this happens, and take your ideas into account.

As concerning the asynchronicity, I believe the tree parsing would be done in a separate worker thread. It could be the delay is small and therefore not noticeable as you mentioned - we would have to test to check this.

microsoft / vscode

Declarative treesitter-based indentation rules #208985