Multiple entry points? - Githubissues

mjambon commented 3 years ago

I'm wondering if there's a way or a plan to support multiple entry points to a grammar. The following is just for context. We have a workaround.

Use case

We're just starting to use tree-sitter grammars to parse semgrep patterns. A pattern is a code snippet written in the target programming language, with extra constructs for pattern matching (... and $FOO). For example, we may want to parse the following snippet as an expression:

$X == 42

but the default entry point of the grammar doesn't accept naked expressions.

Problem

We're extending each language's grammar to support semgrep patterns, but since the generated code is large, we don't want to generate a new parser.c for each entry point that we want to support. A solution would be to turn the entry point into something like choice($.program, $.expression) but this results in conflicts that may be silent or resolved incorrectly, leading to errors. A simpler solution is for our code to try one parser and then another e.g. parse_program and then parse_expression if parse_program fails.

Here's the semgrep PR where the issue was raised: https://github.com/returntocorp/semgrep-grammars/pull/11#issuecomment-754867449

Workaround

A workaround is to turn a semgrep pattern $X == 42 into __SEMGREP_EXPRESSION $X == 42 before parsing it, and to use an extended tree-sitter grammar that supports this unambiguous construct. This is equivalent to prepending a magic number to the input.

maxbrunsfeld commented 3 years ago

I think it would be a good feature, but it seems like it would require a deep investigation of how to best implement it. Ideally, we could do it without major changes to the parser binary format.

I wonder if it would be possible implement something where you provide a list of valid nodes, and Tree-sitter will return all the possible trees for those nodes that parse without error.

mjambon commented 3 years ago

Thanks Max. For what it's worth, being able to select the entry point(s) at runtime sounds like a good idea to me.

maxbrunsfeld commented 3 years ago

A workaround is to turn a semgrep pattern $X == 42 into __SEMGREP_EXPRESSION $X == 42 before parsing it

Could you just turn it into int x() { $X == 42; }, so that you don't need to modify the grammar at all? You could just parse that using the existing parser, and then extract out the piece of the CST corresponding to the expression.

mjambon commented 3 years ago

@maxbrunsfeld good idea!

maxbrunsfeld commented 3 years ago

I wonder if it'd even be worthwhile to transform $X into something that's already known to be syntactically valid, like SEMGREP_PLACEHOLDER_X, and transform ... into SEMGREP_ELLIPSIS, so that you could just use the stock parsers. It doesn't seem like the post-processing of the returned tree would be too awkward to handle. Just a random idea.

I guess the problem with that is, you basically need to parse the pattern just to find the $X and the ....

mjambon commented 3 years ago

Yes, postprocessing the tree is straighforward. It's the initial replacement of ... and $X that would be hard to do correctly.

tek commented 3 years ago

I'm encountering this problem while trying to write injection rules for quasiquote string interpolation in Haskell.

The code looks like this:

[text|foo #{some expression :: Int} bar|]

where text calls a custom quoter that processes the body, treating parts between #{ and } as expressions, while to Haskell, this part is just a string. I've created an ad-hoc grammar that I delegate to with an injection, which parses the delimiters and then injects the code inside back to Haskell. The problem is that the code isn't a top level declaration, but an expression. If I could specify the rule to use for the parser to begin when processing the injection, this would work.

Fasa22 commented 2 months ago

https://github.com/tree-sitter/tree-sitter/issues/3574#issue-2499514753

tree-sitter / tree-sitter

Multiple entry points? #870

Use case

Problem

Workaround