tree-sitter / py-tree-sitter

Python bindings to the Tree-sitter parsing library
https://tree-sitter.github.io/py-tree-sitter/
MIT License
800 stars 95 forks source link

Query doesn't work correctly when using read_callable #169

Closed MrPrezident closed 4 days ago

MrPrezident commented 10 months ago

The read_callable feature doesn't seem to be working correctly. When I try to use it, the queries get messed up. Here is an example.

import tree_sitter

tree_sitter.Language.build_library(
  # Store the library in the `build` directory
  'build/my-languages.so',

  # Include one or more languages
  [
    'vendor/tree-sitter-python'
  ]
)

# Initialize parser and load Python grammar
parser = tree_sitter.Parser()
python_language = tree_sitter.Language('build/my-languages.so', "python")
parser.set_language(python_language)

# Initialize test query (match identifier with all caps)
query_text = '((identifier) @test (#match? @test "^[A-Z]$"))'
python_query = python_language.query(query_text)

# Original Source Code
orig_src  = b'def myfunc():\n    pass\n'

# Parse original tree and query
orig_tree = parser.parse(orig_src)
orig_captures = python_query.captures(orig_tree.root_node)

# Print results
print(f"orig_sexp:{orig_tree.root_node.sexp()}")
print(f"orig_text:{orig_tree.root_node.text}")
print(f"orig_captures:{orig_captures}")

# Edit Source Code (insert newline)
# new_src = b'def myfunc():\n\n    pass\n'
new_src = orig_src[:14] + b'\n' + orig_src[14:]

# Edit Tree
orig_tree.edit(start_byte=14, old_end_byte=14, new_end_byte=15,
               start_point=(1,0), old_end_point=(1,0), new_end_point=(1,1))

# Define parse callback to read source code
def read_callable_bytes(byte_offset, point):
    byte = new_src[byte_offset:byte_offset + 1]
    return byte

# Parse changes with full source code
new1_tree = parser.parse(new_src, orig_tree)
# Parse changes using callback
new2_tree = parser.parse(read_callable_bytes, orig_tree)

# Query
new1_captures = python_query.captures(new1_tree.root_node)
new2_captures = python_query.captures(new2_tree.root_node)

# Print results
print(f"new1_sexp:{new1_tree.root_node.sexp()}")
print(f"new1_text:{new1_tree.root_node.text}")
print(f"new1_captures:{new1_captures}")
print(f"new2_sexp:{new2_tree.root_node.sexp()}")
print(f"new2_text:{new2_tree.root_node.text}")
print(f"new2_captures:{new2_captures}")

Here is the output. The expectation is that there will be no matches on any of the queries. Notice that the new2_captures matched "myfunc" even though it is not capitalized. This only happens when using the read_callable callback.

orig_sexp:(module (function_definition name: (identifier) parameters: (parameters) body: (block (pass_statement))))
orig_text:b'def myfunc():\n    pass\n'
orig_captures:[]
new1_sexp:(module (function_definition name: (identifier) parameters: (parameters) body: (block (pass_statement))))
new1_text:b'def myfunc():\n\n    pass\n'
new1_captures:[]
new2_sexp:(module (function_definition name: (identifier) parameters: (parameters) body: (block (pass_statement))))
new2_text:None
new2_captures:[(<Node type=identifier, start_point=(0, 4), end_point=(0, 10)>, 'test')]
MrPrezident commented 10 months ago

Actually no need to do the tree edit. I don't know why I included that in the example. Here is a simpler example showing an incorrect match:

# Initialize test query (match identifier with all caps)
query_text = '((identifier) @test (#match? @test "^[A-Z]$"))'
python_query = python_language.query(query_text)

# Original Source Code
orig_src  = b'def myfunc():\n    pass\n'

def read_callable_bytes(byte_offset, point):
    byte = orig_src[byte_offset:byte_offset + 1]
    return byte

# Parse original tree and query
orig_tree = parser.parse(read_callable_bytes)
orig_captures = python_query.captures(orig_tree.root_node)

# Print results
print(f"orig_sexp:{orig_tree.root_node.sexp()}")
print(f"orig_text:{orig_tree.root_node.text}")
print(f"orig_captures:{orig_captures}")

Output:

orig_sexp:(module (function_definition name: (identifier) parameters: (parameters) body: (block (pass_statement))))
orig_text:None
orig_captures:[(<Node type=identifier, start_point=(0, 4), end_point=(0, 10)>, 'test')]
amaanq commented 10 months ago

In this case - a read callable does not have the ability to index/remember the source input, so a query match predicate can't execute on that

MrPrezident commented 10 months ago

Is there a fundamental reason why read callable is not used for query? It seems inefficient to have to pass in the entire source code on every keystroke.

MrPrezident commented 10 months ago

Also, can you point me to where the code is for handling the match predicate?

amaanq commented 10 months ago

Is there a fundamental reason why read callable is not used for query? It seems inefficient to have to pass in the entire source code on every keystroke.

You cannot get the text at a specific point for the match to execute unlike using a buffer, sure, some read callbacks might be able to, but it's not a guarantee hence why it's omitted

Also, can you point me to where the code is for handling the match predicate?

https://github.com/tree-sitter/py-tree-sitter/blob/da6716b8665bef4c530ee82007e77c45e363a9db/tree_sitter/binding.c#L1883

MrPrezident commented 10 months ago

@amaanq let me know what you think of this PR. I tested this out and it works great for me. I don't see why this wouldn't work for any type of read callback.

MrPrezident commented 9 months ago

@amaanq any love or feedback on this PR?

amaanq commented 9 months ago

Hey @MrPrezident I believe upstream (Rust/Wasm) doesn't keep text from sources that use a "read_callable" as well, so I'd like to implement it there first before here - I'm also a bit concerned about the performance cost about keeping the text in certain cases when conditionally switching the input based on external programmatic sources (e.g. a test in Rust will run forever until a timeout is reached - wouldn't that be expensive to keep? https://github.com/tree-sitter/tree-sitter/blob/6019b7c84cd5a7f580d57bb981315afd9afec826/cli/src/tests/parser_test.rs#L676

MrPrezident commented 9 months ago

I believe upstream (Rust/Wasm) doesn't keep text from sources that use a "read_callable" as well, so I'd like to implement it there first before here

Hmm.. ok. I can try to look into that.

I'm also a bit concerned about the performance cost about keeping the text in certain cases when conditionally switching the input based on external programmatic sources

but it's not keeping the text, it's just keeping the callback, or am I missing something?

amaanq commented 9 months ago

Well it collects the bytes from the callable right, that's what I meant sorry for not clarifying, like in that timeout test it's possible that it might accumulate a lot of bytes before the timeout is hit.

Maybe a flag in core to enable keeping the source text when using a callable would be nice

MrPrezident commented 9 months ago

Yeah but that test you are talking about is doing parsing, not querying. I didn't make any changes that would affect what happens to the callback during parsing.