Open wklimowicz opened 8 months ago
+1 for this!
Another +1!
I'm not that familiar with treesitter syntax or how it operates, but I wonder if prior art from other languages that have doc comments would help (e.g. rust seems to have the hang of it)
If we were to support this, I do think we'd use the Rust approach linked above
The key points are:
comment
node, so you don't have to do anything special for roxygen2 if you just want to ignore all commentscomment
node, it would possible to extract an optional field("doc", $.doc_content)
. If that field exists as a child of the comment
node, then you can consider the line a roxygen2 comment line$.doc_content
would contain the content of the doc line. i.e. it would drop the leading #'
and possibly leading whitespace after the '
.
\n
, supposedly this is useful for markdown injection according to the Rust grammar@param
and other tagsI think that is as far as tree-sitter-r would go. We currently don't have any rules in tree-sitter-r that rely on community conventions and external packages, so I am somewhat hesitant to even do this!
I'm not entirely sure, but I think according to https://github.com/tree-sitter/tree-sitter-rust/pull/212 one thing the neovim community could do is create a roxygen2
tree-sitter grammar that is injected in when it sees a doc_content
node. Essentially that would set the target language of the content to be roxygen2
, which could then parse things like @param
and add syntax highlighting for roxygen2 tags like that.
Alternatively you could probably use an all
style query to look for a consecutive block of comments that all have a "doc"
field. That would basically let you use that field as a marker, and you'd extract all the underlying text from that consecutive range of nodes, and you could probably post process that (I imagine this is already possible today with a all-match
based query that looks for #'
).
If someone who has some neovim developer experience wants to chime in on this plan, that would be helpful! It would be particularly useful to hear if you can already work around this using a match based query to find #'
lines, and post process those yourself in some way (that's preferable to me to not rely on a community convention in the grammar), or if this really would be a massive improvement.
Relevant links:
Hey 👋 I have been playing with the match suggestion in Neovim this morning. It seems that sub-node highlighting is not possible by using tree-sitter queries alone. I don't have anything to back that claim other than my experiments, though.
We currently don't have any rules in tree-sitter-r that rely on community conventions and external packages, so I am somewhat hesitant to even do this!
That's a fair point. I have roxygen2 so ingrained into my brain that I wouldn't even bat an eye!
With that in mind, I don't think that adding a special treatment for #'
would make sense for tree-sitter-r. Instead...
create a roxygen2 tree-sitter grammar that is injected
I think this is the way to go. Once tree-sitter-r-roxygen2 grammar exists, all that is left to do is creating queries/injections.scm
in tree-sitter-r:
((comment) @injection.content
(#set! injection.language "r_roxygen2"))
References:
injection.combined
is worth noting - it could help with implicit title and description and @examples
<style>
) - this is done in nvim-treesitterI am interested in giving it a go and creating the grammar for roxygen2. I won't give any ETA, but I will report back if it takes off!
However, my experiments didn't go in vain. If you don't want to wait for tree-sitter-r-roxygen2 to come around, then I have a Lua-based solution that at a glance replicates syntax/r.vim
highlighting.
This approach relies on an autocommand. The upside? It works. The downside? Manual setup[^1].
[^1]: If this could be done with a query alone, then we would have an out-of-the-box solution and everyone would benefit.
This approach could be expanded to handle things like @importFrom pkg func1 func2
too. For example, it could differentiate the tag, the package name, and function names with different colors. You can go as far as you string-wrangling skills allow you to.
https://github.com/user-attachments/assets/0cef7a8d-743f-4dbf-80eb-8c66e45a8a1c
[!IMPORTANT] Place this code in
after/ftplugin/r.lua
. It relies onftplugin
to run only in R files.
local get_root = function(bufnr, lang)
local parser = vim.treesitter.get_parser(bufnr, lang, {})
local tree = parser:parse()[1]
return tree:root()
end
local highlight_roxygen2_tags = function(bufnr)
local query = [[
((comment) @comment.roxygen2
(#lua-match? @comment.roxygen2 "^#' (@%a+).*$")
(#gsub! @comment.roxygen2 "^#' (@%a+).*$" "%1"))
]]
local root = get_root(bufnr, "r")
local ts_query = vim.treesitter.query.parse("r", query)
local ns = vim.api.nvim_create_namespace("r.comments.roxygen2")
for id, node, metadata in ts_query:iter_captures(root, bufnr) do
if ts_query.captures[id] == "comment.roxygen2" then
local start_row, _, end_row, _ = vim.treesitter.get_node_range(node)
local start_col = 3 -- skip leading "#' "
local end_col = start_col + #metadata[id].text -- add tag length
vim.highlight.range(bufnr, ns, "@operator", { start_row, start_col }, { end_row, end_col })
end
end
end
vim.api.nvim_create_autocmd({ "BufWinEnter", "TextChanged", "TextChangedI" }, {
desc = "Highlight roxygen2 tags",
buffer = 0,
callback = function(args)
highlight_roxygen2_tags(args.buf)
end,
})
For
((comment) @injection.content
(#set! injection.language "r_roxygen2"))
it feels wrong to me to assert that every comment now has the language of r_roxygen2. That's what that says, right?
That was why I was suggesting that you'd only do that on (doc_content)
, even though that does require just a little bit of knowledge about roxygen2 in tree-sitter-r
Right. It could be changed to:
((comment) @injection.content
(#match? @injection.content "^#' ")
(#set! injection.language "r_roxygen2"))
I suppose this comes down to roxygen2 grammar design. For the above to work, the grammar would have to handle #'
on its own as the entire (comment)
would get re-parsed by it. Otherwise, it would depend on (doc_content)
.
I just did a quick scan through other "doc" grammars. luadoc grammar doesn't include the leading comment string. On the other hand, jsdoc and phpdoc both include it.
Personally, I don't have a strong opinion which approach should a roxygen2 grammar take. Nor I see any immediate benefits of using one approach over the other.
Personally I think if we can make something like
((comment) @injection.content
(#match? @injection.content "^#' ")
(#set! injection.language "r_roxygen2"))
work then that would be greatly preferable to keep tree-sitter-r agnostic to any R packages. What you have there is pretty close to what I thought was possible with the existing setup. And its nice to see that there is some prior art like jsdoc that pretty much does it exactly this way. I see that tree-sitter-javascript even has an injection query for jsdoc (i don't think jsdoc has any marker character like #'
, so it makes sense that they just mark all comments as possibly jsdoc comments) https://github.com/tree-sitter/tree-sitter-javascript/blob/b6f0624c1447bc209830b195999b78a56b10a579/queries/injections.scm#L20-L23
Note that the exact rule for "this is a roxygen2 comment" is more flexible than just ^#'
. It technically allows:
#
, like #####'
is validIn theory the roxygen2 grammar could also give us the ability to mark contents inside an @examples
block as R code
((examples_content) @injection.content
(#set! injection.language "r"))
i.e.
#' @param x a param
#'
#' @examples # the lines after this one are R code if you strip the leading `#'`
#' 1 + 1
#' fn(
#' a,
#' b
#' )
Which could allow highlighting of R code in @examples
block to maybe just work?
Also worth looking at injection.combined
as an option here https://tree-sitter.github.io/tree-sitter/syntax-highlighting#language-injection. It seems like it would smash all roxygen2 comment lines together into one nested document, and then parse that whole document once, which seems like it would be nice? IIUC that would allow the multi-line @param
here to be parsed as 1 node containing the tag and its full description
#' @param x this is a long multiline
#' description of this param
i don't think jsdoc has any marker character like
#'
The comment block has to start with /**
. The way I understand the jsdoc grammar skips leading asterisk and any subsequent whitespace using extras
and then looks for a leading /*
(making it /**
).
It also looks like as soon as it doesn't match /**
it falls back to a regular comment.
Note that the exact rule for "this is a roxygen2 comment" is more flexible than just
^#'
.
Today I learned! Thanks for pointing that out. I think extras
could be explored for stripping that in a similar manner jsdoc removes asterisks and whitespace.
In theory the roxygen2 grammar could also give us the ability to mark contents inside an
@examples
block as R code
That's what I also thought :-)
injection.combined
Yes! I would definitely take a look on how to efficiently pass things around. My intuition is to try passing continuous block of comments to roxygen2 parser and treating them as one document (similarly to what jsdoc does).
I suppose the next step would be sketching an outline of how the roxygen2 grammar would be structured... :-)
I have been experimenting with how to approach tree-sitter-roxygen2 and ran into an issue. R uses line comments, and tree-sitter cannot combine multiple nodes into a single capture. This means it's not possible to properly inject the roxygen2 parser into the R parser.
For example, given the following comment:
#' @description Lorem ipsum dolor sit amet,
#' mauris elit justo sociosqu, mauris vel at.
#' At quam, amet ultrices at cras et semper.
#' @noRd
The result would be:
(comment
(tag
name: (tag_name) ; (tag_name) is "@description"
description: (description))) ; (description) is "Lorem ipsum dolor sit amet,"
(comment)
(comment)
(comment
(tag
name: (tag_name))) ; (tag_name) is "@noRd"
While this is not a showstopper, it means that tree-sitter-roxygen2 won't be as robust as we imagined, because it will work at the line level.
Unless it is possible to make the R grammar join (comment)
nodes from adjacent lines into a single node. However doing, that doesn't feel right to me.
I thought this was what injection.combined
was for
Does that not combine the sequential comments into 1 "document" that tree-sitter-roxygen2 gets?
Are ROxygen tag highlighting something that can be added? With treesitter disabled they get highlighted from the main neovim runtime:
https://github.com/neovim/neovim/blob/c651a0f643e7bd34eb740069a7b5b8c9f8759ecc/runtime/syntax/r.vim#L96-L159
Thanks for your work maintaining this project!