Allow a syntax rule to fail multiple branch points.

Thom1729 commented 4 years ago

Problem description

As a motivating example, consider the following (taken from the core JavaScript syntax and simplified):

class-body-contents:
  - match: (?=async{{identifier_break}})
    branch_point: prefixed-method
    branch:
      - prefixed-method
      - class-element

object-literal-contents:
  - match: (?=async{{identifier_break}})
    branch_point: prefixed-object-literal-method
    branch:
      - prefixed-object-literal-method
      - object-literal-element

prefixed-method:
  - match: async{{identifier_break}}
    scope: storage.type.js
    set:
      - match: (?=\*|{{class_element_name}})
        set: method-declaration
      - match: (?=\S)
        fail: prefixed-method

prefixed-object-literal-method:
  - match: async{{identifier_break}}
    scope: storage.type.js
    set:
      - match: (?=\*|{{class_element_name}})
        set: method-declaration
      - match: (?=\S)
        fail: prefixed-object-literal-method

(In the original syntax, some of the rules and contexts are longer and handle more cases.)

In either a class body or an object literal, the token async usually means an async method, but not always. The “success” case is the same in either place, but the “failure” cases differ. In a class body, it falls back to parsing a class field or method, and in an object literal it falls back to parsing a key/value pair or method.

Because the success cases are the same, the prefixed-method and prefixed-object-literal-method contexts are absolutely identical except that they fail different branch_points. It would save 18 lines of duplicated code in the syntax definition if both of the branch rules could share a single prefixed-method context. However, this is impossible because branch_points must be unique, so there is no alternative but to copy-paste the prefixed-method context.

This is the first time I've run into this limitation, but seeing the shape of it, it seems likely that it will generally be difficult to reuse contexts in different branches, even when the contexts are otherwise identical.

Preferred solution

Allow fail to specify a sequence of branch names, as in following example:

class-body-contents:
  - match: (?=async{{identifier_break}})
    branch_point: prefixed-method
    branch:
      - prefixed-method
      - class-element

object-literal-contents:
  - match: (?=async{{identifier_break}})
    branch_point: prefixed-object-literal-method
    branch:
      - prefixed-method
      - object-literal-element

prefixed-method:
  - match: async{{identifier_break}}
    scope: storage.type.js
    set:
      - match: (?=\*|{{class_element_name}})
        set: method-declaration
      - match: (?=\S)
        fail:
          - prefixed-method
          - prefixed-object-literal-method

When the fail is encountered, the parser should fail the topmost branch on the stack that matches any of the specified branch points.

This would make it much easier to reuse branch success contexts.

Alternatives

It would also suffice if multiple rules could use the same branch_point name. This might arguably be simpler from an authoring standpoint. However, this may be complex or fiddly to implement in the engine.

Thom1729 commented 4 years ago

I've run into another major use case for this today. In the JavaScript syntax, we try to identify when the user is assigning an anonymous function to a variable, field, property, etc. A key reusable “moving part” is a context that matches an expression, but fails if the expression begins with a function literal. However, this context would want to be used by multiple branch points (off the top of my head, for assignment expressions, variable initializations, object keys, and class fields). Without that capability, it would require a great deal of copy-pasting to make all of the cases work.

deathaxe commented 4 years ago

Things like that are what makes the current attempt to rewrite Java a very huge untertaking. Nearly any kind of (un-)qualified identifier needs a copy/pasted set of contexts which all look the same but need to be unique due to the bound branch point.

Branches reduce reusability a lot atm. It gets more tricky the more contexts are pushed until a failure can be detected.

Thom1729 commented 3 years ago

I've run into a sort-of-related issue. The fail and pop keys are exclusive, and if fail is present then pop will be ignored.

This is a problem for context reuse. I would like to have one copy of the TypeScript expression contexts. Normally, expression-begin should pop on an unexpected token so that parsing will recover. However, if we're attempting to parse function call type arguments, any unexpected token should fail the branch. It would be easiest to do this with a single expression-begin context with both fail and pop at the end:

    - match: (?=\S)
      fail: ts-function-type-arguments
      pop: true

Because fail and pop are exclusive, this doesn't work; the pop is ignored, so in the normal case unexpected tokens will be ignored and the parser will be stuck in the expression-begin context.

The only obvious workaround would be to duplicate the entire set of expression-related contexts, which would be a mess — the same mess as duplicating contexts to fail different branch_points. (This is why I think it's sort-of-related.)

wbond commented 3 years ago

It isn’t clear to me how you’d expect fail and pop to work together. Fail rewinds the lexer to the last branch point with that name and restores the context stack to that point, whereas pop just modified the stack.

Thom1729 commented 3 years ago

The intent is that it should attempt to fail, but if that branch_point is not on the stack, then it should pop.

wbond commented 3 years ago

This actually is a bit more nuanced, because there are more failure modes than it not being on the stack. Right now fail can work even not on the stack, but there is also the situation where all existing branch options have been exhausted.

deathaxe commented 3 years ago

While I agree with branching to limit context re-use here and there, I would vote to not add more complexity or heuristic behavior. It is already hard enough to track all the possible states and paths the lexer can take when using branches. In case of the example (?=\S) it should be fairly easy to create two named contexts arguments-or-fail and arguments-or-pop which can be used in the appropriate situation. May just need some more and smaller named contexts to build those from.

Thom1729 commented 3 years ago

In this case, that would require duplicating the entire chunk of contexts involving type expressions. Re-using ts-type-expression-begin is the easy part — you could have a base context with most of the rules, and two contexts including the base with different “else” behavior.

But the contexts that push ts-type-expression-begin, such as ts-type-expression-end, would have to be copy-pasted with each rule tweaked to refer to the appropriate variant of ts-type-expression-begin. We'd also have to copy-paste-tweak any contexts “within” ts-type-expression-begin that might push a type expression context, such as parenthesized type expressions, function type expressions, object type expressions, and tuple type expressions. Only the simplest contexts could be re-used, such as ts-type-basic and ts-type-special. In practice, about a quarter of the definition would have to be copy-pasted and slightly tweaked, then maintained in sync going forward.

Not impossible by any means, but also far from ideal.

Thom1729 commented 3 years ago

Another example from the JavaScript syntax is arrow function detection.

Arrow function formal parameters may consist either of a single bare identifier or of a parenthesized list. These require slightly different handling (especially in TypeScript), so they are handled by separate branch rules, and consequently there is a fair bit of context duplication. If the two rules could share a branch_point, or if a rule could fail multiple branch_points, then the duplication would be reduced.

https://github.com/sublimehq/Packages/issues/2267 poses further issues. In order to determine whether an identifier should get an extra entity.name.function scope, we have to look several tokens ahead; the only reliable solution would be branching. In essence, we'd want to speculatively parse the assigned expression as a non-function and fail if it turns out to be a function. In a perfect world, we could reuse some of the same code that already handles arrow functions, but because of the branch_point limitations, this is not possible.

But in this case, the duplication would be even worse. There are several kinds of identifier that could get the entity.name.function scope:

Variables in declarations.
Variables in assignment expressions.
Property names.
Class field names.
Object keys.

In each case, the identifier itself needs to be scoped differently, and since this necessarily happens inside the branch, this means that each case would require its own branch_point, meaning that most of the contexts inside the branch would have to be duplicated five ways, each failing different branch_points. The logic might require on the order of five contexts, which when multiplied by five would be an unacceptable degree of code duplication.

Thom1729 commented 3 years ago

As an aside, there may be some reason to prefer re-using a single branch_point across multiple rules, as opposed to a single rule failing multiple branch_points. That reason is syntax extension.

If you have a set of contexts used by multiple branch rules, with fail rules that name each branch_point, and you need to reuse those contexts in an extension, then you would need to override those contexts to add another branch_point to each of the fail rules. If two extensions did the same thing, then one of them would break. On the other hand, if the branch_points themselves were reusable, then you would not need to override the fail rules at all, and the extension would be much simpler and more compatible.

wbond commented 3 years ago

The issue with reusing branch names is having multiple rules in a context with the same name. A name resolves to an id. When lexing, we track the id that led to the current lexer state, and use that id when backtracking, etc.

Thom1729 commented 3 years ago

It would be fine if a branch_point couldn't be used twice within the same context; this could be easily worked around with only a small bit of extra code. (I also think that it would be fairly uncommon to need to do this; out of the practical examples I have in mind, the only one that this would matter for is arrow function detection in expression-begin, and even with an extra context to work around this limitation it would still be simpler on net.)

The only downside I can see is that if multiple extensions tried to prepend rules to the same base context with the same branch_point, then this would produce an error, but this could be easily worked around by any of the extensions.

I can see how allowing multiple rules in a single context to use the same branch_point might require more internal changes than the overall feature is worth. But I think it should be safe to leave that particular restriction in place since it could be worked around without much trouble.

wbond commented 3 years ago

It really needs to be that the branch id always deals with the same set of targets. If there are multiple different rules with different targets that have the same id, the implementation would break in funny ways.

I think implementation wise it would probably be easier to implement named target sets, maybe named fail sets, and allow inheritance to modify those. I think it would be easier for people reason about also.

sublimehq / sublime_text