tweag / topiary

https://topiary.tweag.io/
MIT License
575 stars 29 forks source link

Handling comments correctly #430

Open Xophmeister opened 1 year ago

Xophmeister commented 1 year ago

:bulb: This is not really a bug or a feature, but a consequence that we should probably address.

e.g., #425 is a manifestation of this.

Is your feature request related to a problem? Please describe. Comments and other nodes that can appear (almost) anywhere in the syntax -- so called "extra nodes" in Tree-sitter -- break simple queries that target syntactic structures for formatting. For example, if you have a formatting query that looks like this:

(if_statement
  "then" @append_softline
  .
  (clause)
)

Then this won't match and therefore won't have formatting applied:

if something then # here's a comment
  do_something

:bulb: This example is illustrative, to demonstrate the problem. In this case, the query could be trivially refactored to avoid the problem. However, this is not always possible.

Describe alternatives you've considered There are a few possible solutions to this problem. Including, but not necessarily limited to:

  1. gofmt splits the input into two streams -- comments and code -- formats them independently, then stitches them back together.
  2. The query engine in Tree-sitter could be enhanced to, optionally, ignore extra nodes in matches. (e.g., The above example would match, despite the interposed comment.)
  3. Queries can be rewritten with (comment)? (or equivalent) in appropriate places. This can get messy.
Xophmeister commented 1 year ago

Note: Option 3 does not work in general and requires more subtlety. For example, if you have a node that you want followed by a softline, you can't just add a (comment)? node:

(
  (my_node) . (comment)? @append_empty_softline
)

This will match whether the comment is present or not, but the @append_empty_softline capture name will only apply when the comment exists.

Applying the capture name to the group should solve this problem, but -- anecdotally and for reasons I've yet to understand -- it doesn't always work:

(
  (my_node)
  .
  (comment)?
) @append_empty_softline

Presumably you wouldn't want this behaviour, anyway, because the (comment) will have its own formatting. As such, a @do_nothing directive would probably be the correct approach:

(
  (my_node) @append_empty_softline
  .
  (comment)? @do_nothing
)

This comes at the potential expense of factoring out other related capture names that you don't want to have cancelled. (e.g., If you have indentation directives in this query, you'd have to move them elsewhere, etc.)

torhovland commented 1 year ago

Yes, but given that we don't have a more advanced solution available to us for the time being, the (comment)? @do_nothing pattern is the way to go.

Xophmeister commented 1 year ago

Indeed. I'm just documenting for maintainership's sake.

torhovland commented 1 year ago

I guess I was triggered by "Option 3 does not work in general". Perhaps it's better to rewrite option 3 slightly, then 🙂

aspiwack commented 1 year ago

Does it mean that we have to use the idiom

(
  (my_node) @append_empty_softline
  .
  (comment)? @do_nothing
)

everywhere where a newline is inserted?

torhovland commented 1 year ago

Pretty much, yes.

Although that doesn't have to be quite so excessive as it sounds, because we can have a few patterns that take care of a lot of those, such as:

(
  [
    (attribute_item)
    (enum_item)
    (extern_crate_declaration)
    (function_item)
    (impl_item)
    (let_declaration)
    (mod_item)
    (struct_item)
    (type_item)
    (use_declaration)
  ] @append_hardline
  .
  [
    (block_comment)
    (line_comment)
  ]* @do_nothing
)
Xophmeister commented 1 year ago

@aspiwack Potentially. However, there are times where a query can be rewritten to avoid it. There may also be times when you want the newline, regardless of a comment.

nbacquey commented 1 year ago

263 is another example of an issue for which the proper solution is very convoluted, because of the potential presence of comments and attributes (which are another type of extra nodes in OCaml).

nbacquey commented 1 year ago

I've experimented a bit to solve this issue. Here are my observations so far:

I would like your input before proceeding further, in particular:

Xophmeister commented 1 year ago

I haven't checked your code, so ignore me if I'm stating the obvious! Can you edit the source backwards -- i.e., starting at the comment with the greatest byte offset -- to avoid having to reparse the tree on every edit?

torhovland commented 1 year ago

I don't think it's possible: the tree-sitter API doesn't allow this kind of modifications on a query.

But in theory, it should be possible to parse the query using the tree-sitter grammar, make modifications to the syntax tree, and generate a new query file, right?

Not saying this is necessarily the approach we want to take, though.

torhovland commented 1 year ago

Perhaps there is a fourth possibility:

Introduce a pre-processing step to the query files to allow for something macro-like before feeding them to tree-sitter. So rather than having to add arcane variations of (comment)? @do_nothing everywhere, we could just add %allow_comment wherever necessary. We may still need to allow for variations of those, unless we can make the pre-processor smart enough to always understand the correct context, so it is unclear how much it would help.

nbacquey commented 1 year ago

I haven't checked your code, so ignore me if I'm stating the obvious! Can you edit the source backwards -- i.e., starting at the comment with the greatest byte offset -- to avoid having to reparse the tree on every edit?

I didn't know if tree.edit() could support multiple edits before reparsing, but apparently it does! I applied your suggestion to apply all edits at once, and I got the benchmark down to reasonable levels.