Open Thom1729 opened 4 years ago
I've run into another major use case for this today. In the JavaScript syntax, we try to identify when the user is assigning an anonymous function to a variable, field, property, etc. A key reusable “moving part” is a context that matches an expression, but fails if the expression begins with a function literal. However, this context would want to be used by multiple branch points (off the top of my head, for assignment expressions, variable initializations, object keys, and class fields). Without that capability, it would require a great deal of copy-pasting to make all of the cases work.
Things like that are what makes the current attempt to rewrite Java a very huge untertaking. Nearly any kind of (un-)qualified identifier needs a copy/pasted set of contexts which all look the same but need to be unique due to the bound branch point.
Branches reduce reusability a lot atm. It gets more tricky the more contexts are pushed until a failure can be detected.
I've run into a sort-of-related issue. The fail
and pop
keys are exclusive, and if fail
is present then pop
will be ignored.
This is a problem for context reuse. I would like to have one copy of the TypeScript expression contexts. Normally, expression-begin
should pop
on an unexpected token so that parsing will recover. However, if we're attempting to parse function call type arguments, any unexpected token should fail the branch. It would be easiest to do this with a single expression-begin
context with both fail
and pop
at the end:
- match: (?=\S)
fail: ts-function-type-arguments
pop: true
Because fail
and pop
are exclusive, this doesn't work; the pop
is ignored, so in the normal case unexpected tokens will be ignored and the parser will be stuck in the expression-begin
context.
The only obvious workaround would be to duplicate the entire set of expression-related contexts, which would be a mess — the same mess as duplicating contexts to fail
different branch_point
s. (This is why I think it's sort-of-related.)
It isn’t clear to me how you’d expect fail and pop to work together. Fail rewinds the lexer to the last branch point with that name and restores the context stack to that point, whereas pop just modified the stack.
The intent is that it should attempt to fail
, but if that branch_point
is not on the stack, then it should pop
.
This actually is a bit more nuanced, because there are more failure modes than it not being on the stack. Right now fail
can work even not on the stack, but there is also the situation where all existing branch options have been exhausted.
While I agree with branching to limit context re-use here and there, I would vote to not add more complexity or heuristic behavior. It is already hard enough to track all the possible states and paths the lexer can take when using branches. In case of the example (?=\S)
it should be fairly easy to create two named contexts arguments-or-fail
and arguments-or-pop
which can be used in the appropriate situation. May just need some more and smaller named contexts to build those from.
In this case, that would require duplicating the entire chunk of contexts involving type expressions. Re-using ts-type-expression-begin
is the easy part — you could have a base context with most of the rules, and two contexts including the base with different “else” behavior.
But the contexts that push ts-type-expression-begin
, such as ts-type-expression-end
, would have to be copy-pasted with each rule tweaked to refer to the appropriate variant of ts-type-expression-begin
. We'd also have to copy-paste-tweak any contexts “within” ts-type-expression-begin
that might push a type expression context, such as parenthesized type expressions, function type expressions, object type expressions, and tuple type expressions. Only the simplest contexts could be re-used, such as ts-type-basic
and ts-type-special
. In practice, about a quarter of the definition would have to be copy-pasted and slightly tweaked, then maintained in sync going forward.
Not impossible by any means, but also far from ideal.
Another example from the JavaScript syntax is arrow function detection.
Arrow function formal parameters may consist either of a single bare identifier or of a parenthesized list. These require slightly different handling (especially in TypeScript), so they are handled by separate branch rules, and consequently there is a fair bit of context duplication. If the two rules could share a branch_point
, or if a rule could fail multiple branch_point
s, then the duplication would be reduced.
https://github.com/sublimehq/Packages/issues/2267 poses further issues. In order to determine whether an identifier should get an extra entity.name.function
scope, we have to look several tokens ahead; the only reliable solution would be branching. In essence, we'd want to speculatively parse the assigned expression as a non-function and fail if it turns out to be a function. In a perfect world, we could reuse some of the same code that already handles arrow functions, but because of the branch_point
limitations, this is not possible.
But in this case, the duplication would be even worse. There are several kinds of identifier that could get the entity.name.function
scope:
In each case, the identifier itself needs to be scoped differently, and since this necessarily happens inside the branch, this means that each case would require its own branch_point
, meaning that most of the contexts inside the branch would have to be duplicated five ways, each fail
ing different branch_point
s. The logic might require on the order of five contexts, which when multiplied by five would be an unacceptable degree of code duplication.
As an aside, there may be some reason to prefer re-using a single branch_point
across multiple rules, as opposed to a single rule fail
ing multiple branch_point
s. That reason is syntax extension.
If you have a set of contexts used by multiple branch rules, with fail
rules that name each branch_point
, and you need to reuse those contexts in an extension, then you would need to override those contexts to add another branch_point
to each of the fail
rules. If two extensions did the same thing, then one of them would break. On the other hand, if the branch_point
s themselves were reusable, then you would not need to override the fail
rules at all, and the extension would be much simpler and more compatible.
The issue with reusing branch names is having multiple rules in a context with the same name. A name resolves to an id. When lexing, we track the id that led to the current lexer state, and use that id when backtracking, etc.
It would be fine if a branch_point
couldn't be used twice within the same context; this could be easily worked around with only a small bit of extra code. (I also think that it would be fairly uncommon to need to do this; out of the practical examples I have in mind, the only one that this would matter for is arrow function detection in expression-begin
, and even with an extra context to work around this limitation it would still be simpler on net.)
The only downside I can see is that if multiple extensions tried to prepend rules to the same base context with the same branch_point
, then this would produce an error, but this could be easily worked around by any of the extensions.
I can see how allowing multiple rules in a single context to use the same branch_point
might require more internal changes than the overall feature is worth. But I think it should be safe to leave that particular restriction in place since it could be worked around without much trouble.
It really needs to be that the branch id always deals with the same set of targets. If there are multiple different rules with different targets that have the same id, the implementation would break in funny ways.
I think implementation wise it would probably be easier to implement named target sets, maybe named fail sets, and allow inheritance to modify those. I think it would be easier for people reason about also.
Problem description
As a motivating example, consider the following (taken from the core JavaScript syntax and simplified):
(In the original syntax, some of the rules and contexts are longer and handle more cases.)
In either a class body or an object literal, the token
async
usually means an async method, but not always. The “success” case is the same in either place, but the “failure” cases differ. In a class body, it falls back to parsing a class field or method, and in an object literal it falls back to parsing a key/value pair or method.Because the success cases are the same, the
prefixed-method
andprefixed-object-literal-method
contexts are absolutely identical except that theyfail
differentbranch_point
s. It would save 18 lines of duplicated code in the syntax definition if both of the branch rules could share a singleprefixed-method
context. However, this is impossible becausebranch_point
s must be unique, so there is no alternative but to copy-paste theprefixed-method
context.This is the first time I've run into this limitation, but seeing the shape of it, it seems likely that it will generally be difficult to reuse contexts in different branches, even when the contexts are otherwise identical.
Preferred solution
Allow
fail
to specify a sequence of branch names, as in following example:When the
fail
is encountered, the parser should fail the topmost branch on the stack that matches any of the specified branch points.This would make it much easier to reuse branch success contexts.
Alternatives
It would also suffice if multiple rules could use the same
branch_point
name. This might arguably be simpler from an authoring standpoint. However, this may be complex or fiddly to implement in the engine.