tree-sitter / tree-sitter-rust

Rust grammar for tree-sitter
MIT License
341 stars 97 forks source link

Designate doc comments #99

Closed resolritter closed 2 years ago

resolritter commented 3 years ago

closes #88

resolritter commented 3 years ago

My first attempted implementation was using regular expressions, but, unfortunately, the examples were not passing (https://travis-ci.org/github/tree-sitter/tree-sitter-rust/builds/753422004#L413). My regular expressions seemingly had trouble capturing the pattern / / \n?; the third character in the sequence is what's decides between a line or doc comment, so I thought it made more sense to use externals for this in order to leverage lexer->lookahead probes.

Therefore, I've reworked the approach to use externals (commented in https://github.com/tree-sitter/tree-sitter-rust/pull/99/commits/be26083df01c81079aa706bf06ffd511825c6011).

resolritter commented 3 years ago

Since doc comments are generally multi-line, I've made the /// contiguous lines be parsed as a group instead of a node per line. That should make it easier to extract the whole block.

It'd be useful to have a node without the leading ///, leaving only the inner content, so that the Markdown parser could be injected... I don't know if it's possible to do that currently.

Luni-4 commented 3 years ago

@resolritter

Is there any update on this one?

dodomorandi commented 3 years ago

What's the current status on this? Is any help needed?

bestouff commented 2 years ago

Ping ?

pickfire commented 2 years ago

What about # within doc comments code blocks?

resolritter commented 2 years ago

I do not have merge access to this repository. This is not the only PR in review limbo right now.

aryx commented 2 years ago

@dcreager is there someone maintaining tree-sitter-rust at github right now?

archseer commented 2 years ago

It would be great if the rust grammar had more maintainers (for example tree-sitter-haskell has been doing great since tek took the reigns). Both this, #105 and #85 are PRs I've been looking forward to.

aryx commented 2 years ago

@resolritter can you rebase on the latest and I'll approve the PR.

resolritter commented 2 years ago

FYI https://github.com/tree-sitter/tree-sitter-rust/issues/126

resolritter commented 2 years ago

The current implementation is incomplete because Rust requires strictly three slashes for doc comments, any more than that and it turns back to a normal line comment. This is explained in the current documentation for comments: https://doc.rust-lang.org/reference/comments.html#examples.

Moving this to Draft while I try to fix that.

Edit: Should be fixed

aryx commented 2 years ago

need to rebase, because I've merged your other PR. Having those parser.c in the repository is annoying.

resolritter commented 2 years ago

Rebased and also added support for //! as per https://github.com/tree-sitter/tree-sitter-rust/pull/99#discussion_r780672464

maxbrunsfeld commented 2 years ago

This PR seems to have introduced a bug where the scanner can get into an infinite loop. I'm seeing the Tree-sitter test suite hang after updating tree-sitter-rust to the latest master. I guess the test coverage in this repo wasn't sufficient to protect against this. I'm going to revert this for now.

maxbrunsfeld commented 2 years ago

It looks like the infinite loop was happening during some randomized mutation of the Doc comments test. Maybe an unterminated doc comment at the end of the file or something?

@resolritter Feel free to do a new PR if you can get this to avoid an infinite loop.

I also have questions about the need to do this using the external scanner. Couldn't you just do this in the grammar?

grammar({
  // ...

  rules: {
    // ...

    doc_comment: $ => token(choice(
      // exactly three leading slashes
      seq('///', optional(/[^/].*/)),
      seq('//!', /.*/),
    )),

    // any number of leading slashes other than three, which would produce a doc comment.
    line_comment: $ => token(seq(
      '//', optional('//'), /.*/
    )),
  }
});
maxbrunsfeld commented 2 years ago

Of course, this would not join all of the adjacent doc comments into one continuous node - you'd get one node per comment. I think this might be better though: I think it would make it easier to determine what ranges of text contain the documentation itself, because you wouldn't have to deal with leading whitespace. I also just think that it retains more information to provide a node for each comment, and it's somewhat "lossy" to group them all into one node.

I'm still open to the other approach though, if people have strong feelings that it's more useful to get a single node.

maxbrunsfeld commented 2 years ago

Whatever happens, I promise we won't wait a year to merge this time.

resolritter commented 2 years ago

@maxbrunsfeld

I also have questions about the need to do this using the external scanner. Couldn't you just do this in the grammar? Of course, this would not join all of the adjacent doc comments into one continuous node - you'd get one node per comment

Having the whole text in a single node is why it was done this way. What would be the alternative for highlighting the code below?

/// ```
/// use foo::Foo;
/// let bar = Foo::new("foo");
/// ```

I can only infer the following steps:

  1. Collect all the consecutive doc comment nodes in the same nesting level
  2. Join their slices from the source code into a buffer
  3. Remove the leading comment markers
  4. Reparse the text as markdown

Having the text in a single node gets rid of steps 1 and 2. Or do you see a more efficient way to go about that? Or do you think having to traverse the tree in order to collect the text is not a problem?

It looks like the infinite loop was happening during some randomized mutation of the Doc comments

How can I try this randomization when testing locally?

resolritter commented 2 years ago

I also just think that it retains more information to provide a node for each comment, and it's somewhat "lossy" to group them all into one node.

This is a good point. It is definitely "lossier" than other nodes since it also includes the leading whitespace for contiguous lines.

What's being gained by this approach is the ease of fetching the whole content directly, at the cost of less precision for the ranges. Feel free to close #128 if you feel like it isn't a good tradeoff.

maxbrunsfeld commented 2 years ago

What would be the alternative for highlighting the code below?

I think there are challenges either way, but it is more straightforward if you have a separate node for each comment.

Copying the doc comments' text into a separate buffer is not an option - we need to parse the code in place so that the positions of the nodes in the nested syntax tree correspond correctly to the original file. So what we need to do is to retrieve a list of ranges from the original file that should be parsed, together, in a nested language (markdown). We can then parse the contents of those ranges using Tree-sitter's set_included_ranges API.

If we have a separate node for each comment, then we need to

If, on the other hand, we have one giant node, then we need to:

resolritter commented 2 years ago

I was not aware that (query)+ existed. This seems to work:

> ./node_modules/.bin/tree-sitter query <(echo -e "((line_comment)+ @doc)") <(echo -e "// foo\nfoo;\n// foo\n// foo")
/dev/fd/61
  pattern: 0
    capture: 0 - doc, start: (0, 0), end: (0, 6), text: `// foo`
  pattern: 0
    capture: 0 - doc, start: (2, 0), end: (2, 6), text: `// foo`
    capture: 0 - doc, start: (3, 0), end: (3, 6), text: `// foo`

Since this use-case is supported by the query API, I am fine with closing #128.

aryx commented 2 years ago

What are those tree-sitter test suites that catch the regression? Could we run them in the CI of tree-sitter-rust?