Closed romgrk closed 4 years ago
Hi, sorry for the delay in response. Nice work on this so far.
I would definitely be open to merging this. I think that I would prefer for the API to match the the query API in the WASM binding. Unfortunately, it's not really documented right now; the best way to understand it is to read the test file.
Summary:
Language.query(source)
returns a Query
.Query.matches
returns an array of matches (using ts_query_cursor_next_match
)Query.captures
returns an array of captures (using ts_query_cursor_next_capture
)TSQueryCursor
is not exposed directly. There is just a private global cursor that is used to implement Query.matches
and Query.captures
.Thoughts?
Also, I'm just curious if you tried using the wasm-based bindings (web-tree-sitter
) instead of this module for your project, and if so, what features of this module you found more useful.
Also, I'm just curious if you tried using the wasm-based bindings (
web-tree-sitter
) instead of this module for your project, and if so, what features of this module you found more useful.
I haven't tried the wasm version. I remember seeing somewhere that the performance of the wasm version was slower than the native one, thus why I'm interested in this one. If that claim is not applicable anymore I could try the wasm bindings.
I'll reply to the other points, I should have more time in a few days.
Hello,
I'm also very interested in the query API, and I will be happy to help!
I also use this module and not web-tree-sitter
for the performance reason mentioned here
for the performance reason mentioned here
I'm actually not sure that comment is still true (or ever was true for most use cases). It might be worth doing a quick performance comparison in your application.
Good to know! I will give a try. Thanks @maxbrunsfeld
Unfortunately, I was unable to load the Ruby Language using wasm. I opened an issue here about that.
Hey @maxbrunsfeld, sorry for the delay, there is never enough time to do everything :/
So about your comments, here are my thoughts:
The
TSQueryCursor
is not exposed directly. There is just a private global cursor that is used to implementQuery.matches
andQuery.captures
.
Sounds good to me, nothing is gained by exposing the query cursor to the users.
Query.matches
returns an array of matches (usingts_query_cursor_next_match
)Query.captures
returns an array of captures (usingts_query_cursor_next_capture
)
Sounds good as well, though I would still keep Query.exec(tree: TSTree, callback: (...) -> void)
. I like the .matches
and .captures
versions because it avoids crossing the JS/native boundary too much, but OTOH .exec
can avoid unnecessary JS object allocations if the matches are used right away.
For example, in my use-case the highlights are passed to a GtkTextView
so the JS objects are never kept around.
All this to say I'm not sure enough which style of API would be more efficient, and I would like to be able to have both possibilities as an user of node-tree-sitter
. Is it ok with you if we keep both styles?
Language.query(source)
returns aQuery.
On this one I think I'd prefer to keep new Query(Language, source)
. If I understand correctly, Language
is a different package. I'm not sure I see how we could practically modify the Language
to include .query
on it. This package exposes the Query
object, therefore it should be the one on which the Query
creation function should be exposed.
If it's ok with you I'd also bump the tree-sitter version to include https://github.com/tree-sitter/tree-sitter/pull/615 I'm not entirely sure I've understood correctly, but predicates are not handled by tree-sitter right? Do they need to be exposed to the user?
Let me know what you think on the above comments and I'll try to finish the PR in the next week.
On this one I think I'd prefer to keep
new Query(Language, source)
. If I understand correctly,Language
is a different package.
Yes, you are absolutely right!
And 👍 to all of your suggestions.
Predicates are not handled by tree-sitter right? Do they need to be exposed to the user?
That's right. Ideally, we would handle the eq?
and match?
predicates automatically, filtering out matches/captures as needed, but I think we can add that later.
So the way the node marshalling is setup, when sending a list of nodes from C++ to JS, C++ for each nodes returns either the nodeTypeID or the cached node object. However the queries API does something that doesn't happen otherwise: it can return the same node multiple times (if multiple pattern match). So C++ sends multiple times the same node (serialized in nodeTransferArray
), JS does the NodeClass instantiation, then tries to store this new instance in C++'s cache via tree._cacheNode(node)
, but then Tree complains because its assertion isn't respected:
assert(!tree->cached_nodes_.count(key)); // tree.cc : 230 -_-
There were a couple ways to solve it, I'm quite hesitant about my implementation, I'd like your input. In JS's unmarshalNodes
, I've introduced a cache that remembers the nodes that have been converted for this call of unmarshalNodes
. The cache is passed to unmarshalNode
, which extracts the node's id as a BigInt
and uses it as the cache key to avoid recrating the same node multiple times. It's all contained in this commit.
A few other solutions:
Local<Object> Tree::GetCachedNode(...)
that would return already cached nodes.Tree::CacheNode(...)
, let 'em be GC'ed eventually.Could you let me know what you think of it? I'll gladly rework this.
Other questions/comments:
if (node.id)
; in which case can node.id
be empty?@maxbrunsfeld I really hope this will get merged soon, it will open so many new doors!
There were a couple ways to solve it, I'm quite hesitant about my implementation, I'd like your input.
I think your solution is pretty good. It looks like BigInt
was available even in node 10.x, so I think it is fine to use.
Expose a
Local<Object> Tree::GetCachedNode(...)
that would return already cached nodes.
I am also open to this solution if you think it would be cleaner.
Do the Node instantiation in C++
I don't want to do this because instantiating objects in C++ is much slower than doing so in JavaScript. It essentially always goes through the slowest-possible path (a hash-table like interface, instead of using hidden classes).
Ignore duplicate nodes in Tree::CacheNode(...), let 'em be GC'ed eventually.
But how would you prevent the .matches
/.captures
methods from multiple different Node
instances that represent the same underlying node? I think we need to deal with duplicates explicitly in order to prevent this.
I saw multiple checks for
if (node.id)
; in which case can node.id be empty?
This means that Tree-sitter API failed to return a node. For example, if you call ts_node_child
on a node with no children, it will return a "null" node whose id
is NULL
.
About predicates, I saw this, I think I'd like to use a similar approach and handle them in JS
That sounds great!
Ok, well if the BigInt
solution is good enough I'll leave that part as it is for now.
I don't want to do this because instantiating objects in C++ is much slower than doing so in JavaScript. It essentially always goes through the slowest-possible path (a hash-table like interface, instead of using hidden classes).
Good you mention it, I had no idea. And there's no way to hint V8 to use a hidden class? Eg
class Hint {
field1 = 0;
field2 = 0;
}
bindings.setHintFunction(Hint);
auto instance = Nan::NewInstance(hintClass);
Nan::Set(instance, Nan::New("field1").ToLocalChecked(), Nan::New(42));
I'm asking because currently Querry::Matches
and Query::Captures
do object instantiation in C++, which I wrongly assumed would use hidden classes. If not, I see 2 alternatives to marshal matches & captures:
Any preference?
I've added the predicates matching by copying most of the code from the WASM implementation, everything seems to be working fine.
Remaining tasks:
After getting a feel for the API, I've removed Query.exec
because I don't feel like I'd use it or recommend its usage.
Everything else has been implemented, note that I took the tests from the WASM bindings so everything is as similar as possible.
The only remaining issue is the matches & captures marshalling. I've removed the object creation in C++ and replaced it with arrays as it was the easiest solution. I've done some testing and it does shave off some time. Obviously the cost of creating and GC'ing a lot of temporary arrays is always there (and not visible in my tests), so let me know if you'd rather have me transfer everything in a buffer I'll make time to implement it.
testing with: react-dom.min.js
| Method | parse+query+captures |
|-----------|----------------------|
| JS Object | 742.284ms |
| JS Array | 652.561ms |
Edit: I've taken the liberty to include a fix for #66 which is a leak I think.
Ok, so I've done the changes, everything should be good now.
I've kept the transfer in a flattened JS array instead of a C buffer. Two minor optimizations, first when unmarshalNodes
is called I've added Tree::CacheNodes
to cache multiple nodes at once rather than call Tree::CacheNode
multiple times in a row.
I've ran some tests (parsing react-dom.min.js
), this removed about 20ms (on 145ms) from unmarshalNodes
.
Second optimization is that I don't pass capture_count
in the buffer anymore, so the transfer array looks like:
// first match
5, // pattern index
"foo",
"bar,
// second match
7, // pattern index
"baz,
]
And the captures are accumulated with
while (typeof returnedMatches[i] === 'string') // ...
Didn't expect it to affect the result that much but this gives a ~9% gain (for Query.prototype.captures
):
with-capture-count: 470.35ms (+- 63.07ms)
without-capture-count: 430.4485 (+- 59.45ms) -9.2%
I've been waiting for this to be merged for about a month now, is there anything holding it up? In the meantime I've tried using @romgrk 's fork directly, but run into some difficulties building (not too familiar with node-gyp, and make
output). It would be great to have queries integrated into the published package.
Ah, sorry about the delay. I missed that last comment going by. Thanks for the PR @romgrk.
This is published in 0.16.2.
Can I use this from web-tree-sitter somehow? Doesn't seem like I can update that one, might be missing something here.
This is for the Node tree-sitter binding. The query API already exists in web-tree-sitter.
Thanks for the feedback & help @maxbrunsfeld!
Hey,
I need access to the query API for a project, here is a draft implementation of it. Are you interesting in including it in the main repo?
Thanks.
Fixes #56