r-lib / tree-sitter-r

MIT License
113 stars 35 forks source link

Highlight ROxygen tags #68

Open wklimowicz opened 8 months ago

wklimowicz commented 8 months ago

Are ROxygen tag highlighting something that can be added? With treesitter disabled they get highlighted from the main neovim runtime:

https://github.com/neovim/neovim/blob/c651a0f643e7bd34eb740069a7b5b8c9f8759ecc/runtime/syntax/r.vim#L96-L159

Treesitter Enabled Treesitter Disabled
image image

Thanks for your work maintaining this project!

wurli commented 6 months ago

+1 for this!

zkamvar commented 3 months ago

Another +1!

zkamvar commented 3 months ago

I'm not that familiar with treesitter syntax or how it operates, but I wonder if prior art from other languages that have doc comments would help (e.g. rust seems to have the hang of it)

DavisVaughan commented 2 months ago

If we were to support this, I do think we'd use the Rust approach linked above

The key points are:

I think that is as far as tree-sitter-r would go. We currently don't have any rules in tree-sitter-r that rely on community conventions and external packages, so I am somewhat hesitant to even do this!

I'm not entirely sure, but I think according to https://github.com/tree-sitter/tree-sitter-rust/pull/212 one thing the neovim community could do is create a roxygen2 tree-sitter grammar that is injected in when it sees a doc_content node. Essentially that would set the target language of the content to be roxygen2, which could then parse things like @param and add syntax highlighting for roxygen2 tags like that.

Alternatively you could probably use an all style query to look for a consecutive block of comments that all have a "doc" field. That would basically let you use that field as a marker, and you'd extract all the underlying text from that consecutive range of nodes, and you could probably post process that (I imagine this is already possible today with a all-match based query that looks for #').

If someone who has some neovim developer experience wants to chime in on this plan, that would be helpful! It would be particularly useful to hear if you can already work around this using a match based query to find #' lines, and post process those yourself in some way (that's preferable to me to not rely on a community convention in the grammar), or if this really would be a massive improvement.

Relevant links:

TymekDev commented 2 months ago

Hey 👋 I have been playing with the match suggestion in Neovim this morning. It seems that sub-node highlighting is not possible by using tree-sitter queries alone. I don't have anything to back that claim other than my experiments, though.

We currently don't have any rules in tree-sitter-r that rely on community conventions and external packages, so I am somewhat hesitant to even do this!

That's a fair point. I have roxygen2 so ingrained into my brain that I wouldn't even bat an eye!

With that in mind, I don't think that adding a special treatment for #' would make sense for tree-sitter-r. Instead...

create a roxygen2 tree-sitter grammar that is injected

I think this is the way to go. Once tree-sitter-r-roxygen2 grammar exists, all that is left to do is creating queries/injections.scm in tree-sitter-r:

((comment) @injection.content
  (#set! injection.language "r_roxygen2"))

References:

I am interested in giving it a go and creating the grammar for roxygen2. I won't give any ETA, but I will report back if it takes off!

However, my experiments didn't go in vain. If you don't want to wait for tree-sitter-r-roxygen2 to come around, then I have a Lua-based solution that at a glance replicates syntax/r.vim highlighting.

Lua-based Solution for Neovim

This approach relies on an autocommand. The upside? It works. The downside? Manual setup[^1].

[^1]: If this could be done with a query alone, then we would have an out-of-the-box solution and everyone would benefit.

This approach could be expanded to handle things like @importFrom pkg func1 func2 too. For example, it could differentiate the tag, the package name, and function names with different colors. You can go as far as you string-wrangling skills allow you to.

Demo

https://github.com/user-attachments/assets/0cef7a8d-743f-4dbf-80eb-8c66e45a8a1c

Code

[!IMPORTANT] Place this code in after/ftplugin/r.lua. It relies on ftplugin to run only in R files.

local get_root = function(bufnr, lang)
  local parser = vim.treesitter.get_parser(bufnr, lang, {})
  local tree = parser:parse()[1]
  return tree:root()
end

local highlight_roxygen2_tags = function(bufnr)
  local query = [[
((comment) @comment.roxygen2
  (#lua-match? @comment.roxygen2 "^#' (@%a+).*$")
  (#gsub! @comment.roxygen2 "^#' (@%a+).*$" "%1"))
]]
  local root = get_root(bufnr, "r")
  local ts_query = vim.treesitter.query.parse("r", query)
  local ns = vim.api.nvim_create_namespace("r.comments.roxygen2")
  for id, node, metadata in ts_query:iter_captures(root, bufnr) do
    if ts_query.captures[id] == "comment.roxygen2" then
      local start_row, _, end_row, _ = vim.treesitter.get_node_range(node)
      local start_col = 3 -- skip leading "#' "
      local end_col = start_col + #metadata[id].text -- add tag length

      vim.highlight.range(bufnr, ns, "@operator", { start_row, start_col }, { end_row, end_col })
    end
  end
end

vim.api.nvim_create_autocmd({ "BufWinEnter", "TextChanged", "TextChangedI" }, {
  desc = "Highlight roxygen2 tags",
  buffer = 0,
  callback = function(args)
    highlight_roxygen2_tags(args.buf)
  end,
})
DavisVaughan commented 2 months ago

For

((comment) @injection.content
  (#set! injection.language "r_roxygen2"))

it feels wrong to me to assert that every comment now has the language of r_roxygen2. That's what that says, right?

That was why I was suggesting that you'd only do that on (doc_content), even though that does require just a little bit of knowledge about roxygen2 in tree-sitter-r

TymekDev commented 2 months ago

Right. It could be changed to:

((comment) @injection.content
  (#match? @injection.content "^#' ")
  (#set! injection.language "r_roxygen2"))

I suppose this comes down to roxygen2 grammar design. For the above to work, the grammar would have to handle #' on its own as the entire (comment) would get re-parsed by it. Otherwise, it would depend on (doc_content).

I just did a quick scan through other "doc" grammars. luadoc grammar doesn't include the leading comment string. On the other hand, jsdoc and phpdoc both include it.

Personally, I don't have a strong opinion which approach should a roxygen2 grammar take. Nor I see any immediate benefits of using one approach over the other.

DavisVaughan commented 2 months ago

Personally I think if we can make something like

((comment) @injection.content
  (#match? @injection.content "^#' ")
  (#set! injection.language "r_roxygen2"))

work then that would be greatly preferable to keep tree-sitter-r agnostic to any R packages. What you have there is pretty close to what I thought was possible with the existing setup. And its nice to see that there is some prior art like jsdoc that pretty much does it exactly this way. I see that tree-sitter-javascript even has an injection query for jsdoc (i don't think jsdoc has any marker character like #', so it makes sense that they just mark all comments as possibly jsdoc comments) https://github.com/tree-sitter/tree-sitter-javascript/blob/b6f0624c1447bc209830b195999b78a56b10a579/queries/injections.scm#L20-L23

Note that the exact rule for "this is a roxygen2 comment" is more flexible than just ^#'. It technically allows:

DavisVaughan commented 2 months ago

In theory the roxygen2 grammar could also give us the ability to mark contents inside an @examples block as R code

((examples_content) @injection.content
  (#set! injection.language "r"))

i.e.

#' @param x a param
#'
#' @examples # the lines after this one are R code if you strip the leading `#'`
#' 1 + 1 
#' fn(
#'  a,
#'  b
#' )

Which could allow highlighting of R code in @examples block to maybe just work?


Also worth looking at injection.combined as an option here https://tree-sitter.github.io/tree-sitter/syntax-highlighting#language-injection. It seems like it would smash all roxygen2 comment lines together into one nested document, and then parse that whole document once, which seems like it would be nice? IIUC that would allow the multi-line @param here to be parsed as 1 node containing the tag and its full description

#' @param x this is a long multiline
#'   description of this param
TymekDev commented 2 months ago

i don't think jsdoc has any marker character like #'

The comment block has to start with /**. The way I understand the jsdoc grammar skips leading asterisk and any subsequent whitespace using extras and then looks for a leading /* (making it /**).

It also looks like as soon as it doesn't match /** it falls back to a regular comment.

Note that the exact rule for "this is a roxygen2 comment" is more flexible than just ^#'.

Today I learned! Thanks for pointing that out. I think extras could be explored for stripping that in a similar manner jsdoc removes asterisks and whitespace.

In theory the roxygen2 grammar could also give us the ability to mark contents inside an @examples block as R code

That's what I also thought :-)

injection.combined

Yes! I would definitely take a look on how to efficiently pass things around. My intuition is to try passing continuous block of comments to roxygen2 parser and treating them as one document (similarly to what jsdoc does).


I suppose the next step would be sketching an outline of how the roxygen2 grammar would be structured... :-)

TymekDev commented 1 month ago

I have been experimenting with how to approach tree-sitter-roxygen2 and ran into an issue. R uses line comments, and tree-sitter cannot combine multiple nodes into a single capture. This means it's not possible to properly inject the roxygen2 parser into the R parser.

For example, given the following comment:

#' @description Lorem ipsum dolor sit amet,
#' mauris elit justo sociosqu, mauris vel at.
#' At quam, amet ultrices at cras et semper.
#' @noRd

The result would be:

(comment
  (tag
    name: (tag_name)                ; (tag_name) is "@description"
    description: (description)))    ; (description) is "Lorem ipsum dolor sit amet,"
(comment)
(comment)
(comment
  (tag
    name: (tag_name)))              ; (tag_name) is "@noRd"

While this is not a showstopper, it means that tree-sitter-roxygen2 won't be as robust as we imagined, because it will work at the line level.

Unless it is possible to make the R grammar join (comment) nodes from adjacent lines into a single node. However doing, that doesn't feel right to me.

DavisVaughan commented 1 month ago

I thought this was what injection.combined was for

https://github.com/tree-sitter/tree-sitter/blob/8500e331ebfd49e66dd935b8a9c7a58aba68af37/docs/section-4-syntax-highlighting.md?plain=1#L370-L371

Does that not combine the sequential comments into 1 "document" that tree-sitter-roxygen2 gets?