It's very common for users to want to match TODO in addition to TODO: (ending in colon). To match that, users needed to match over all text nodes, since every word is a text node, that was slow. So the main improvement here is to not expose text nodes, and instead expose any uppercase word as a node. Unless you write all of your comments very angry in SCREAMING case, performance should improve.
Changes
Don't include text nodes at all, instead expose all uppercase words as tag nodes.
This should improve performance, but it will yield more false positives, but I think it's better if users explicitly match the tag names they want (TODO, NOTE, etc), so it shouldn't be a problem in practice.
Tags are divided in three types:
Simple tags: They are basically any uppercase word.
Normal tags: They are uppercase words immediately followed by a :.
Annotated tags: They are like a normal tag, but they can include an annotation TODO (something):.
The (user) node was renamed to (annotation)
No longer exposing text nodes implies that you can no longer match things like #10 or !10. If people want, I can introduce a way to match those things, maybe a (reference) node /[@#!]([a-zA-Z0-9)/.
While this new approach makes the parser less flexible without the text nodes, it will make it faster, or at least faster to match, since the parsing itself is/was already fast (doing a match over text nodes is the slow operation). Another approach I was thinking is to parse the keywords directly, but that makes it even less flexible, since any new keywords will need to be added in the parser instead of a query.
Moving everything to JS means losing a little bit of control over the exact matching of the tag name, resulting in matches for:
TODO:nospace (previously, it was needed to have a space after :).
TODO(\n note \n): (\n is a literal new line).
I think it's fine for those to match, don't think you see much of those in the wild, if they are a problem, we can try adjusting the rules, or go back to have that small logic in the external scanner.
UPDATE: Did a benchmark, it's SLOWWWW, from 36 ms to 1264 ms, this is parsing a file with 10K lines full of comments. This is probably due to the conflicts that tree-sitter needs to resolve...
It's very common for users to want to match
TODO
in addition toTODO:
(ending in colon). To match that, users needed to match over all text nodes, since every word is a text node, that was slow. So the main improvement here is to not expose text nodes, and instead expose any uppercase word as a node. Unless you write all of your comments very angry in SCREAMING case, performance should improve.Changes
:
.TODO (something):
.(user)
node was renamed to(annotation)
#10
or!10
. If people want, I can introduce a way to match those things, maybe a(reference)
node/[@#!]([a-zA-Z0-9)/
.While this new approach makes the parser less flexible without the text nodes, it will make it faster, or at least faster to match, since the parsing itself is/was already fast (doing a match over text nodes is the slow operation). Another approach I was thinking is to parse the keywords directly, but that makes it even less flexible, since any new keywords will need to be added in the parser instead of a query.
Other notes
I tried moving everything to JS in https://github.com/stsewd/tree-sitter-comment/commit/f2318ba93dca031e57fec92caa60766d9219f4e5, but it was slow, due to the use of
conflicts
.Original comment:
Moving everything to JS means losing a little bit of control over the exact matching of the tag name, resulting in matches for:
TODO:nospace
(previously, it was needed to have a space after:
).TODO(\n note \n):
(\n
is a literal new line).I think it's fine for those to match, don't think you see much of those in the wild, if they are a problem, we can try adjusting the rules, or go back to have that small logic in the external scanner.
UPDATE: Did a benchmark, it's SLOWWWW, from 36 ms to 1264 ms, this is parsing a file with 10K lines full of comments. This is probably due to the conflicts that tree-sitter needs to resolve...