Highlight ROxygen tags - Githubissues

wklimowicz commented 8 months ago

Are ROxygen tag highlighting something that can be added? With treesitter disabled they get highlighted from the main neovim runtime:

https://github.com/neovim/neovim/blob/c651a0f643e7bd34eb740069a7b5b8c9f8759ecc/runtime/syntax/r.vim#L96-L159

Treesitter Enabled	Treesitter Disabled

Thanks for your work maintaining this project!

wurli commented 6 months ago

+1 for this!

zkamvar commented 3 months ago

Another +1!

zkamvar commented 3 months ago

I'm not that familiar with treesitter syntax or how it operates, but I wonder if prior art from other languages that have doc comments would help (e.g. rust seems to have the hang of it)

DavisVaughan commented 2 months ago

If we were to support this, I do think we'd use the Rust approach linked above

The key points are:

There would still be 1 top level comment node, so you don't have to do anything special for roxygen2 if you just want to ignore all comments
Inside of that comment node, it would possible to extract an optional field("doc", $.doc_content). If that field exists as a child of the comment node, then you can consider the line a roxygen2 comment line
$.doc_content would contain the content of the doc line. i.e. it would drop the leading #' and possibly leading whitespace after the '.
- Possibly it would contain the trailing \n, supposedly this is useful for markdown injection according to the Rust grammar
tree-sitter-r would not be in charge of further parsing things like @param and other tags

I think that is as far as tree-sitter-r would go. We currently don't have any rules in tree-sitter-r that rely on community conventions and external packages, so I am somewhat hesitant to even do this!

I'm not entirely sure, but I think according to https://github.com/tree-sitter/tree-sitter-rust/pull/212 one thing the neovim community could do is create a roxygen2 tree-sitter grammar that is injected in when it sees a doc_content node. Essentially that would set the target language of the content to be roxygen2, which could then parse things like @param and add syntax highlighting for roxygen2 tags like that.

Alternatively you could probably use an all style query to look for a consecutive block of comments that all have a "doc" field. That would basically let you use that field as a marker, and you'd extract all the underlying text from that consecutive range of nodes, and you could probably post process that (I imagine this is already possible today with a all-match based query that looks for #').

If someone who has some neovim developer experience wants to chime in on this plan, that would be helpful! It would be particularly useful to hear if you can already work around this using a match based query to find #' lines, and post process those yourself in some way (that's preferable to me to not rely on a community convention in the grammar), or if this really would be a massive improvement.

Relevant links:

TymekDev commented 2 months ago

Hey 👋 I have been playing with the match suggestion in Neovim this morning. It seems that sub-node highlighting is not possible by using tree-sitter queries alone. I don't have anything to back that claim other than my experiments, though.

We currently don't have any rules in tree-sitter-r that rely on community conventions and external packages, so I am somewhat hesitant to even do this!

That's a fair point. I have roxygen2 so ingrained into my brain that I wouldn't even bat an eye!

With that in mind, I don't think that adding a special treatment for #' would make sense for tree-sitter-r. Instead...

create a roxygen2 tree-sitter grammar that is injected

I think this is the way to go. Once tree-sitter-r-roxygen2 grammar exists, all that is left to do is creating queries/injections.scm in tree-sitter-r:

((comment) @injection.content
  (#set! injection.language "r_roxygen2"))

References:

Language injection docs
- injection.combined is worth noting - it could help with implicit title and description and @examples
Injections into HTML (e.g. CSS into <style>) - this is done in nvim-treesitter
Injections into markdown (e.g. code blocks, HTML) - this is done directly in markdown's grammar

I am interested in giving it a go and creating the grammar for roxygen2. I won't give any ETA, but I will report back if it takes off!

However, my experiments didn't go in vain. If you don't want to wait for tree-sitter-r-roxygen2 to come around, then I have a Lua-based solution that at a glance replicates syntax/r.vim highlighting.

Lua-based Solution for Neovim

This approach relies on an autocommand. The upside? It works. The downside? Manual setup[^1].

[^1]: If this could be done with a query alone, then we would have an out-of-the-box solution and everyone would benefit.

This approach could be expanded to handle things like @importFrom pkg func1 func2 too. For example, it could differentiate the tag, the package name, and function names with different colors. You can go as far as you string-wrangling skills allow you to.

Demo

https://github.com/user-attachments/assets/0cef7a8d-743f-4dbf-80eb-8c66e45a8a1c

Code

[!IMPORTANT] Place this code in after/ftplugin/r.lua. It relies on ftplugin to run only in R files.

local get_root = function(bufnr, lang)
  local parser = vim.treesitter.get_parser(bufnr, lang, {})
  local tree = parser:parse()[1]
  return tree:root()
end

local highlight_roxygen2_tags = function(bufnr)
  local query = [[
((comment) @comment.roxygen2
  (#lua-match? @comment.roxygen2 "^#' (@%a+).*$")
  (#gsub! @comment.roxygen2 "^#' (@%a+).*$" "%1"))
]]
  local root = get_root(bufnr, "r")
  local ts_query = vim.treesitter.query.parse("r", query)
  local ns = vim.api.nvim_create_namespace("r.comments.roxygen2")
  for id, node, metadata in ts_query:iter_captures(root, bufnr) do
    if ts_query.captures[id] == "comment.roxygen2" then
      local start_row, _, end_row, _ = vim.treesitter.get_node_range(node)
      local start_col = 3 -- skip leading "#' "
      local end_col = start_col + #metadata[id].text -- add tag length

      vim.highlight.range(bufnr, ns, "@operator", { start_row, start_col }, { end_row, end_col })
    end
  end
end

vim.api.nvim_create_autocmd({ "BufWinEnter", "TextChanged", "TextChangedI" }, {
  desc = "Highlight roxygen2 tags",
  buffer = 0,
  callback = function(args)
    highlight_roxygen2_tags(args.buf)
  end,
})

DavisVaughan commented 2 months ago

For

((comment) @injection.content
  (#set! injection.language "r_roxygen2"))

it feels wrong to me to assert that every comment now has the language of r_roxygen2. That's what that says, right?

That was why I was suggesting that you'd only do that on (doc_content), even though that does require just a little bit of knowledge about roxygen2 in tree-sitter-r

TymekDev commented 2 months ago

Right. It could be changed to:

((comment) @injection.content
  (#match? @injection.content "^#' ")
  (#set! injection.language "r_roxygen2"))

I suppose this comes down to roxygen2 grammar design. For the above to work, the grammar would have to handle #' on its own as the entire (comment) would get re-parsed by it. Otherwise, it would depend on (doc_content).

I just did a quick scan through other "doc" grammars. luadoc grammar doesn't include the leading comment string. On the other hand, jsdoc and phpdoc both include it.

Personally, I don't have a strong opinion which approach should a roxygen2 grammar take. Nor I see any immediate benefits of using one approach over the other.

DavisVaughan commented 2 months ago

Personally I think if we can make something like

((comment) @injection.content
  (#match? @injection.content "^#' ")
  (#set! injection.language "r_roxygen2"))

work then that would be greatly preferable to keep tree-sitter-r agnostic to any R packages. What you have there is pretty close to what I thought was possible with the existing setup. And its nice to see that there is some prior art like jsdoc that pretty much does it exactly this way. I see that tree-sitter-javascript even has an injection query for jsdoc (i don't think jsdoc has any marker character like #', so it makes sense that they just mark all comments as possibly jsdoc comments) https://github.com/tree-sitter/tree-sitter-javascript/blob/b6f0624c1447bc209830b195999b78a56b10a579/queries/injections.scm#L20-L23

Note that the exact rule for "this is a roxygen2 comment" is more flexible than just ^#'. It technically allows:

Leading whitespace
One or more leading #, like #####' is valid
The exact rule is here, I imagine that it wouldn't be too hard to find a regex that supports this https://github.com/r-lib/roxygen2/blob/9652d15221109917d46768e836eaf55e33c21633/src/parser2.cpp#L43-L56

DavisVaughan commented 2 months ago

In theory the roxygen2 grammar could also give us the ability to mark contents inside an @examples block as R code

((examples_content) @injection.content
  (#set! injection.language "r"))

i.e.

#' @param x a param
#'
#' @examples # the lines after this one are R code if you strip the leading `#'`
#' 1 + 1 
#' fn(
#'  a,
#'  b
#' )

Which could allow highlighting of R code in @examples block to maybe just work?

Also worth looking at injection.combined as an option here https://tree-sitter.github.io/tree-sitter/syntax-highlighting#language-injection. It seems like it would smash all roxygen2 comment lines together into one nested document, and then parse that whole document once, which seems like it would be nice? IIUC that would allow the multi-line @param here to be parsed as 1 node containing the tag and its full description

#' @param x this is a long multiline
#'   description of this param

TymekDev commented 2 months ago

i don't think jsdoc has any marker character like #'

The comment block has to start with /**. The way I understand the jsdoc grammar skips leading asterisk and any subsequent whitespace using extras and then looks for a leading /* (making it /**).

It also looks like as soon as it doesn't match /** it falls back to a regular comment.

Note that the exact rule for "this is a roxygen2 comment" is more flexible than just ^#'.

Today I learned! Thanks for pointing that out. I think extras could be explored for stripping that in a similar manner jsdoc removes asterisks and whitespace.

In theory the roxygen2 grammar could also give us the ability to mark contents inside an @examples block as R code

That's what I also thought :-)

injection.combined

Yes! I would definitely take a look on how to efficiently pass things around. My intuition is to try passing continuous block of comments to roxygen2 parser and treating them as one document (similarly to what jsdoc does).

I suppose the next step would be sketching an outline of how the roxygen2 grammar would be structured... :-)

TymekDev commented 1 month ago

I have been experimenting with how to approach tree-sitter-roxygen2 and ran into an issue. R uses line comments, and tree-sitter cannot combine multiple nodes into a single capture. This means it's not possible to properly inject the roxygen2 parser into the R parser.

For example, given the following comment:

#' @description Lorem ipsum dolor sit amet,
#' mauris elit justo sociosqu, mauris vel at.
#' At quam, amet ultrices at cras et semper.
#' @noRd

The result would be:

(comment
  (tag
    name: (tag_name)                ; (tag_name) is "@description"
    description: (description)))    ; (description) is "Lorem ipsum dolor sit amet,"
(comment)
(comment)
(comment
  (tag
    name: (tag_name)))              ; (tag_name) is "@noRd"

While this is not a showstopper, it means that tree-sitter-roxygen2 won't be as robust as we imagined, because it will work at the line level.

Unless it is possible to make the R grammar join (comment) nodes from adjacent lines into a single node. However doing, that doesn't feel right to me.

DavisVaughan commented 1 month ago

I thought this was what injection.combined was for

https://github.com/tree-sitter/tree-sitter/blob/8500e331ebfd49e66dd935b8a9c7a58aba68af37/docs/section-4-syntax-highlighting.md?plain=1#L370-L371

Does that not combine the sequential comments into 1 "document" that tree-sitter-roxygen2 gets?

r-lib / tree-sitter-r

Highlight ROxygen tags #68

Lua-based Solution for Neovim

Demo

Code