tree-sitter / node-tree-sitter

Node.js bindings for tree-sitter
https://www.npmjs.com/package/tree-sitter
MIT License
649 stars 114 forks source link

Implement Query API #62

Closed romgrk closed 4 years ago

romgrk commented 4 years ago

Hey,

I need access to the query API for a project, here is a draft implementation of it. Are you interesting in including it in the main repo?

Thanks.

Fixes #56

maxbrunsfeld commented 4 years ago

Hi, sorry for the delay in response. Nice work on this so far.

I would definitely be open to merging this. I think that I would prefer for the API to match the the query API in the WASM binding. Unfortunately, it's not really documented right now; the best way to understand it is to read the test file.

Summary:

  1. Language.query(source) returns a Query.
  2. Query.matches returns an array of matches (using ts_query_cursor_next_match)
  3. Query.captures returns an array of captures (using ts_query_cursor_next_capture)
  4. The TSQueryCursor is not exposed directly. There is just a private global cursor that is used to implement Query.matches and Query.captures.

Thoughts?

maxbrunsfeld commented 4 years ago

Also, I'm just curious if you tried using the wasm-based bindings (web-tree-sitter) instead of this module for your project, and if so, what features of this module you found more useful.

romgrk commented 4 years ago

Also, I'm just curious if you tried using the wasm-based bindings (web-tree-sitter) instead of this module for your project, and if so, what features of this module you found more useful.

I haven't tried the wasm version. I remember seeing somewhere that the performance of the wasm version was slower than the native one, thus why I'm interested in this one. If that claim is not applicable anymore I could try the wasm bindings.

I'll reply to the other points, I should have more time in a few days.

thibaultdalban commented 4 years ago

Hello,

I'm also very interested in the query API, and I will be happy to help! I also use this module and not web-tree-sitter for the performance reason mentioned here

maxbrunsfeld commented 4 years ago

for the performance reason mentioned here

I'm actually not sure that comment is still true (or ever was true for most use cases). It might be worth doing a quick performance comparison in your application.

thibaultdalban commented 4 years ago

Good to know! I will give a try. Thanks @maxbrunsfeld

thibaultdalban commented 4 years ago

Unfortunately, I was unable to load the Ruby Language using wasm. I opened an issue here about that.

romgrk commented 4 years ago

Hey @maxbrunsfeld, sorry for the delay, there is never enough time to do everything :/

So about your comments, here are my thoughts:

QueryCursor

The TSQueryCursor is not exposed directly. There is just a private global cursor that is used to implement Query.matches and Query.captures.

Sounds good to me, nothing is gained by exposing the query cursor to the users.

.matches and .captures

Query.matches returns an array of matches (using ts_query_cursor_next_match) Query.captures returns an array of captures (using ts_query_cursor_next_capture)

Sounds good as well, though I would still keep Query.exec(tree: TSTree, callback: (...) -> void). I like the .matches and .captures versions because it avoids crossing the JS/native boundary too much, but OTOH .exec can avoid unnecessary JS object allocations if the matches are used right away. For example, in my use-case the highlights are passed to a GtkTextView so the JS objects are never kept around. All this to say I'm not sure enough which style of API would be more efficient, and I would like to be able to have both possibilities as an user of node-tree-sitter. Is it ok with you if we keep both styles?

Query API

Language.query(source) returns a Query.

On this one I think I'd prefer to keep new Query(Language, source). If I understand correctly, Language is a different package. I'm not sure I see how we could practically modify the Language to include .query on it. This package exposes the Query object, therefore it should be the one on which the Query creation function should be exposed.

Question: predicates

If it's ok with you I'd also bump the tree-sitter version to include https://github.com/tree-sitter/tree-sitter/pull/615 I'm not entirely sure I've understood correctly, but predicates are not handled by tree-sitter right? Do they need to be exposed to the user?

Let me know what you think on the above comments and I'll try to finish the PR in the next week.

maxbrunsfeld commented 4 years ago

On this one I think I'd prefer to keep new Query(Language, source). If I understand correctly, Language is a different package.

Yes, you are absolutely right!

And 👍 to all of your suggestions.

Predicates are not handled by tree-sitter right? Do they need to be exposed to the user?

That's right. Ideally, we would handle the eq? and match? predicates automatically, filtering out matches/captures as needed, but I think we can add that later.

romgrk commented 4 years ago

So the way the node marshalling is setup, when sending a list of nodes from C++ to JS, C++ for each nodes returns either the nodeTypeID or the cached node object. However the queries API does something that doesn't happen otherwise: it can return the same node multiple times (if multiple pattern match). So C++ sends multiple times the same node (serialized in nodeTransferArray), JS does the NodeClass instantiation, then tries to store this new instance in C++'s cache via tree._cacheNode(node), but then Tree complains because its assertion isn't respected:

assert(!tree->cached_nodes_.count(key)); // tree.cc : 230  -_-

There were a couple ways to solve it, I'm quite hesitant about my implementation, I'd like your input. In JS's unmarshalNodes, I've introduced a cache that remembers the nodes that have been converted for this call of unmarshalNodes. The cache is passed to unmarshalNode, which extracts the node's id as a BigInt and uses it as the cache key to avoid recrating the same node multiple times. It's all contained in this commit.

A few other solutions:

Could you let me know what you think of it? I'll gladly rework this.

Other questions/comments:

gpetrov commented 4 years ago

@maxbrunsfeld I really hope this will get merged soon, it will open so many new doors!

maxbrunsfeld commented 4 years ago

There were a couple ways to solve it, I'm quite hesitant about my implementation, I'd like your input.

I think your solution is pretty good. It looks like BigInt was available even in node 10.x, so I think it is fine to use.

Expose a Local<Object> Tree::GetCachedNode(...) that would return already cached nodes.

I am also open to this solution if you think it would be cleaner.

Do the Node instantiation in C++

I don't want to do this because instantiating objects in C++ is much slower than doing so in JavaScript. It essentially always goes through the slowest-possible path (a hash-table like interface, instead of using hidden classes).

Ignore duplicate nodes in Tree::CacheNode(...), let 'em be GC'ed eventually.

But how would you prevent the .matches/.captures methods from multiple different Node instances that represent the same underlying node? I think we need to deal with duplicates explicitly in order to prevent this.

I saw multiple checks for if (node.id); in which case can node.id be empty?

This means that Tree-sitter API failed to return a node. For example, if you call ts_node_child on a node with no children, it will return a "null" node whose id is NULL.

About predicates, I saw this, I think I'd like to use a similar approach and handle them in JS

That sounds great!

romgrk commented 4 years ago

Ok, well if the BigInt solution is good enough I'll leave that part as it is for now.

I don't want to do this because instantiating objects in C++ is much slower than doing so in JavaScript. It essentially always goes through the slowest-possible path (a hash-table like interface, instead of using hidden classes).

Good you mention it, I had no idea. And there's no way to hint V8 to use a hidden class? Eg

class Hint {
  field1 = 0;
  field2 = 0;
}
bindings.setHintFunction(Hint);
auto instance = Nan::NewInstance(hintClass);
Nan::Set(instance, Nan::New("field1").ToLocalChecked(), Nan::New(42));

I'm asking because currently Querry::Matches and Query::Captures do object instantiation in C++, which I wrongly assumed would use hidden classes. If not, I see 2 alternatives to marshal matches & captures:

Any preference?

I've added the predicates matching by copying most of the code from the WASM implementation, everything seems to be working fine.

Remaining tasks:

romgrk commented 4 years ago

After getting a feel for the API, I've removed Query.exec because I don't feel like I'd use it or recommend its usage.

Everything else has been implemented, note that I took the tests from the WASM bindings so everything is as similar as possible.

The only remaining issue is the matches & captures marshalling. I've removed the object creation in C++ and replaced it with arrays as it was the easiest solution. I've done some testing and it does shave off some time. Obviously the cost of creating and GC'ing a lot of temporary arrays is always there (and not visible in my tests), so let me know if you'd rather have me transfer everything in a buffer I'll make time to implement it.

testing with: react-dom.min.js

| Method    | parse+query+captures |
|-----------|----------------------|
| JS Object |            742.284ms |
| JS Array  |            652.561ms |

Edit: I've taken the liberty to include a fix for #66 which is a leak I think.

romgrk commented 4 years ago

Ok, so I've done the changes, everything should be good now.

I've kept the transfer in a flattened JS array instead of a C buffer. Two minor optimizations, first when unmarshalNodes is called I've added Tree::CacheNodes to cache multiple nodes at once rather than call Tree::CacheNode multiple times in a row.

I've ran some tests (parsing react-dom.min.js), this removed about 20ms (on 145ms) from unmarshalNodes.

Second optimization is that I don't pass capture_count in the buffer anymore, so the transfer array looks like:

  // first match
  5, // pattern index
  "foo",
  "bar,

  // second match
  7, // pattern index
  "baz,
]

And the captures are accumulated with

while (typeof returnedMatches[i] === 'string') // ...

Didn't expect it to affect the result that much but this gives a ~9% gain (for Query.prototype.captures):

with-capture-count:    470.35ms (+- 63.07ms)
without-capture-count: 430.4485 (+- 59.45ms) -9.2%
smackesey commented 4 years ago

I've been waiting for this to be merged for about a month now, is there anything holding it up? In the meantime I've tried using @romgrk 's fork directly, but run into some difficulties building (not too familiar with node-gyp, and make output). It would be great to have queries integrated into the published package.

maxbrunsfeld commented 4 years ago

Ah, sorry about the delay. I missed that last comment going by. Thanks for the PR @romgrk.

maxbrunsfeld commented 4 years ago

This is published in 0.16.2.

razzeee commented 4 years ago

Can I use this from web-tree-sitter somehow? Doesn't seem like I can update that one, might be missing something here.

maxbrunsfeld commented 4 years ago

This is for the Node tree-sitter binding. The query API already exists in web-tree-sitter.

romgrk commented 4 years ago

Thanks for the feedback & help @maxbrunsfeld!