Open mjambon opened 3 years ago
This is similar to https://github.com/tree-sitter/tree-sitter/issues/711.
I think it would be a good feature, but it seems like it would require a deep investigation of how to best implement it. Ideally, we could do it without major changes to the parser binary format.
I wonder if it would be possible implement something where you provide a list of valid nodes, and Tree-sitter will return all the possible trees for those nodes that parse without error.
Thanks Max. For what it's worth, being able to select the entry point(s) at runtime sounds like a good idea to me.
A workaround is to turn a semgrep pattern
$X == 42
into__SEMGREP_EXPRESSION $X == 42
before parsing it
Could you just turn it into int x() { $X == 42; }
, so that you don't need to modify the grammar at all? You could just parse that using the existing parser, and then extract out the piece of the CST corresponding to the expression.
@maxbrunsfeld good idea!
I wonder if it'd even be worthwhile to transform $X
into something that's already known to be syntactically valid, like SEMGREP_PLACEHOLDER_X
, and transform ...
into SEMGREP_ELLIPSIS
, so that you could just use the stock parsers. It doesn't seem like the post-processing of the returned tree would be too awkward to handle. Just a random idea.
I guess the problem with that is, you basically need to parse the pattern just to find the $X
and the ...
.
Yes, postprocessing the tree is straighforward. It's the initial replacement of ...
and $X
that would be hard to do correctly.
I'm encountering this problem while trying to write injection rules for quasiquote string interpolation in Haskell.
The code looks like this:
[text|foo #{some expression :: Int} bar|]
where text
calls a custom quoter that processes the body, treating parts between #{
and }
as expressions, while to Haskell, this part is just a string.
I've created an ad-hoc grammar that I delegate to with an injection, which parses the delimiters and then injects the code inside back to Haskell.
The problem is that the code isn't a top level declaration, but an expression. If I could specify the rule to use for the parser to begin when processing the injection, this would work.
I'm wondering if there's a way or a plan to support multiple entry points to a grammar. The following is just for context. We have a workaround.
Use case
We're just starting to use tree-sitter grammars to parse semgrep patterns. A pattern is a code snippet written in the target programming language, with extra constructs for pattern matching (
...
and$FOO
). For example, we may want to parse the following snippet as an expression:but the default entry point of the grammar doesn't accept naked expressions.
Problem
We're extending each language's grammar to support semgrep patterns, but since the generated code is large, we don't want to generate a new
parser.c
for each entry point that we want to support. A solution would be to turn the entry point into something likechoice($.program, $.expression)
but this results in conflicts that may be silent or resolved incorrectly, leading to errors. A simpler solution is for our code to try one parser and then another e.g.parse_program
and thenparse_expression
ifparse_program
fails.Here's the semgrep PR where the issue was raised: https://github.com/returntocorp/semgrep-grammars/pull/11#issuecomment-754867449
Workaround
A workaround is to turn a semgrep pattern
$X == 42
into__SEMGREP_EXPRESSION $X == 42
before parsing it, and to use an extended tree-sitter grammar that supports this unambiguous construct. This is equivalent to prepending a magic number to the input.