Xophmeister commented 1 year ago

:bulb: This is not really a bug or a feature, but a consequence that we should probably address.

e.g., #425 is a manifestation of this.

Is your feature request related to a problem? Please describe. Comments and other nodes that can appear (almost) anywhere in the syntax -- so called "extra nodes" in Tree-sitter -- break simple queries that target syntactic structures for formatting. For example, if you have a formatting query that looks like this:

(if_statement
  "then" @append_softline
  .
  (clause)
)

Then this won't match and therefore won't have formatting applied:

if something then # here's a comment
  do_something

:bulb: This example is illustrative, to demonstrate the problem. In this case, the query could be trivially refactored to avoid the problem. However, this is not always possible.

Describe alternatives you've considered There are a few possible solutions to this problem. Including, but not necessarily limited to:

gofmt splits the input into two streams -- comments and code -- formats them independently, then stitches them back together.
The query engine in Tree-sitter could be enhanced to, optionally, ignore extra nodes in matches. (e.g., The above example would match, despite the interposed comment.)
Queries can be rewritten with (comment)? (or equivalent) in appropriate places. This can get messy.

Xophmeister commented 1 year ago

Note: Option 3 does not work in general and requires more subtlety. For example, if you have a node that you want followed by a softline, you can't just add a (comment)? node:

(
  (my_node) . (comment)? @append_empty_softline
)

This will match whether the comment is present or not, but the @append_empty_softline capture name will only apply when the comment exists.

Applying the capture name to the group should solve this problem, but -- anecdotally and for reasons I've yet to understand -- it doesn't always work:

(
  (my_node)
  .
  (comment)?
) @append_empty_softline

Presumably you wouldn't want this behaviour, anyway, because the (comment) will have its own formatting. As such, a @do_nothing directive would probably be the correct approach:

(
  (my_node) @append_empty_softline
  .
  (comment)? @do_nothing
)

This comes at the potential expense of factoring out other related capture names that you don't want to have cancelled. (e.g., If you have indentation directives in this query, you'd have to move them elsewhere, etc.)

torhovland commented 1 year ago

Yes, but given that we don't have a more advanced solution available to us for the time being, the (comment)? @do_nothing pattern is the way to go.

Xophmeister commented 1 year ago

Indeed. I'm just documenting for maintainership's sake.

torhovland commented 1 year ago

I guess I was triggered by "Option 3 does not work in general". Perhaps it's better to rewrite option 3 slightly, then 🙂

aspiwack commented 1 year ago

Does it mean that we have to use the idiom

(
  (my_node) @append_empty_softline
  .
  (comment)? @do_nothing
)

everywhere where a newline is inserted?

torhovland commented 1 year ago

Pretty much, yes.

Although that doesn't have to be quite so excessive as it sounds, because we can have a few patterns that take care of a lot of those, such as:

(
  [
    (attribute_item)
    (enum_item)
    (extern_crate_declaration)
    (function_item)
    (impl_item)
    (let_declaration)
    (mod_item)
    (struct_item)
    (type_item)
    (use_declaration)
  ] @append_hardline
  .
  [
    (block_comment)
    (line_comment)
  ]* @do_nothing
)

Xophmeister commented 1 year ago

@aspiwack Potentially. However, there are times where a query can be rewritten to avoid it. There may also be times when you want the newline, regardless of a comment.

nbacquey commented 1 year ago

263 is another example of an issue for which the proper solution is very convoluted, because of the potential presence of comments and attributes (which are another type of `extra` nodes in OCaml).

nbacquey commented 1 year ago

I've experimented a bit to solve this issue. Here are my observations so far:

I investigated the implementation of Option 3 (rewrite all queries with (comment)? everywhere). I was interested to see if this was doable programmatically, i.e. by parsing the query, then enriching it later with extra (comment)?. I don't think it's possible: the tree-sitter API doesn't allow this kind of modifications on a query.
I experimented with Option 1 (processing comments in a separate stream). This seems doable, but costly. My experiments are here: https://github.com/tweag/topiary/tree/nb/engine/split_comment_stream The only option I've found to edit an existing CST and remove comment nodes is through the tree.edit() and parser.parse(input, old_tree) functions, as shown here. Currently, the code in my branch removes all comments from the input and discards them, by repeating the following until no comment is left:
- Find a comment in the CST by inspecting recursively node.is_extra() and node.kind()
- Edit out the comment by calling tree.edit() and removing the bytes from the input
- Compute a new CST with parser.parse(edited_input, edited_old_tree)
This simple operation of removing all comments from the input already has a non-negligible cost. This is an extract of the output of cargo bench --profile release, on main and on my branch respectively:
```
branch main:
format_ocaml            time:   [627.69 ms 630.84 ms 633.84 ms]
                      change: [-3.9578% -3.4334% -2.9099%] (p = 0.00 < 0.05)
                      Performance has improved.

branch nb/engine/split_comment_stream:
format_ocaml            time:   [821.88 ms 823.24 ms 824.73 ms]
                      change: [+29.836% +30.499% +31.193%] (p = 0.00 < 0.05)
                      Performance has regressed.
```
The 30% performance drop is already noticeable, even without doing anything with the comments I extracted.

I would like your input before proceeding further, in particular:

Have I missed a better way to extract the comments?
Do you think this feature is worth the increase in processing time?

Xophmeister commented 1 year ago

I haven't checked your code, so ignore me if I'm stating the obvious! Can you edit the source backwards -- i.e., starting at the comment with the greatest byte offset -- to avoid having to reparse the tree on every edit?

torhovland commented 1 year ago

I don't think it's possible: the tree-sitter API doesn't allow this kind of modifications on a query.

But in theory, it should be possible to parse the query using the tree-sitter grammar, make modifications to the syntax tree, and generate a new query file, right?

Not saying this is necessarily the approach we want to take, though.

torhovland commented 1 year ago

Perhaps there is a fourth possibility:

Introduce a pre-processing step to the query files to allow for something macro-like before feeding them to tree-sitter. So rather than having to add arcane variations of (comment)? @do_nothing everywhere, we could just add %allow_comment wherever necessary. We may still need to allow for variations of those, unless we can make the pre-processor smart enough to always understand the correct context, so it is unclear how much it would help.

nbacquey commented 1 year ago

I haven't checked your code, so ignore me if I'm stating the obvious! Can you edit the source backwards -- i.e., starting at the comment with the greatest byte offset -- to avoid having to reparse the tree on every edit?

I didn't know if tree.edit() could support multiple edits before reparsing, but apparently it does! I applied your suggestion to apply all edits at once, and I got the benchmark down to reasonable levels.

tweag / topiary

Handling comments correctly #430

263 is another example of an issue for which the proper solution is very convoluted, because of the potential presence of comments and attributes (which are another type of `extra` nodes in OCaml).

tweag / topiary

Handling comments correctly #430

263 is another example of an issue for which the proper solution is very convoluted, because of the potential presence of comments and attributes (which are another type of extra nodes in OCaml).

263 is another example of an issue for which the proper solution is very convoluted, because of the potential presence of comments and attributes (which are another type of `extra` nodes in OCaml).