Proposal: a mixed-language interface accessible within ExternalTokenizer instances

ashtonsix commented 2 years ago

Since opening a thread about mixed-language parsing I've continued to explore this topic, and am now looking at some relevant changes to Lezer. (And am most-way through an implementation).

It'd be great to ascertain whether you (@marijnh) have any interest in making these changes part of Lezer, and if you do, whether I'm going in the right direction for a future pull request. (If we can't find alignment here I'll probably end up creating a fork of Lezer).

Part 1: Nested Parsing

Proposal:

The ExternalTokenizer should be able to run a nested parse and save the result for later. And for some cases, be able to automatically figure out the length of the nested parse.

Interface:

Stack.nestedParse(nested: NestedParse, length?: number): Tree
InputStream.acceptToken(token: number, nested: Tree) // overload

Description (Stack.nestedParse):

Stack.nestedParse() does a parse from stack.pos to stack.pos + length. length is required for non-LR parsers, and optional otherwise. If length is not provided, and either the inner or outer parser is strict, the inner parse grabs as much valid code as it can, ceding control to the outer parse when it stops or encounters an unexpected token.

Incremental Parsing Even if the resulting tree isn't eventually mounted the result of Stack.nestedParse() is recorded on the outer parse and wired up in such a way that it may be reused during a future parse. At the end of each parse any unused trees from previous parses are deleted, to free them from memory.

Error Recovery If both parsers are non-strict (and length is not provided), the outer parse is split (once for each token the external tokenizer can shift), jumps ahead to the inner parse location, and checks each outer parse stack for actions it can take. If any of those stacks can advance then control is ceded to the outer parse. If neither parse can advance normally they both enter recovery mode; if the inner parse recovers first, it continues; if the outer parse recovers first (or neither recovers within 20-ish steps), the inner parse is ended at the unexpected token. The outer parse is then rewound to its state from before Stack.nestedParse() was called.

Overlays I haven't yet figured out what to do when the nested parse specifies an overlay. For now I propose throwing an error.

Description (InputStream.acceptToken):

The full signature for InputStream.acceptToken() will look like:

acceptToken(token: number, endOffset?: number): void
acceptToken(token: number, nested: Tree): void
acceptToken(token: number, offsetOrTree: Tree | number = 0) {
  // implementation...
}

If a tree is given we infer the endOffset from tree.length, and mark the accepted token as a mount, and add the tree to the parse (if it wasn't already added by Stack.nestedParse).

Sub-Proposal:

Add an option for NestedParse.top, as in, Stack.nestedParse({parser: javascript, top: [expressionNoComma]}). Which overwrites LRParser.topRules for that parse.

Part 2: Tree Peeking

Proposal:

The ExternalTokenizer should be able to decide what to accept based on contextual cues, the InputStream API enables this to a limited extent, but for more complicated cases it'd be great to inspect a "local tree".

Interface:

Stack.resolveTree(top: number[]): Tree
Stack.splitShift(term: number): Stack

Description:

Stack.resolveTree() identifies the shortest sequence of actions to reduce any of the given top terms and then returns a tree with that node at the top. All tokens that appear after stack.pos are artificially synthesized and have a length of 0. To "find the shortest sequence" we'll use ranked BFS, with priority given to reduce actions and closing delimiters (slightly faster than A* search), and never visit the same state twice. In the worst case where no path to the given top node exists Stack.resolveTree() will visit each state once. We'll cache the sequence of actions based on the combination of stack.state and top (this cache will be stored on the LRParser instance). We'll also reuse some work for the backwards direction, making use of previous parse trees on the stack buffer (as indicated by "-1 size terms"). Since this looks at the stack rather than input this method shouldn't need to create any performance-impacting dependencies the way InputStream.peek() does. I tried to figure out a way the call to buildTree() (from @lezer/common) could be cached/reused but haven't found a suitable approach, so think that would need to run once for every Stack.resolveTree() call.

Stack.splitShift() splits the stack and adds one artificially synthesized 0-length term. One would use this method if they want to ensure the sequence of actions found by Stack.resolveTree() begins with a particular term, for example, by doing stack.splitShift(EmbedArgument).resolveTree([CallExpression, NewExpression]). I guess Stack.splitShift() could also be used for LR(2) lookaheads but I haven't given that use case much thought.

Sub-Proposal:

Once one actually has a local tree one is likely to want to peek at the relevant input, for example, to see the function name for a JavaScript CallExpression. One can do this with InputStream.peek() today, but that isn't a great approach for two reasons:

peek() only looks at one code unit per invocation, so I suggest adding an overload like InputStream.peek(from: number, to: number), to make grabbing all the input for a given tree node more convenient. input.peek(4, 6) would be equivalent to input.peek(4) + input.peek(5), matching the semantics of string.slice().
Only >25 code units of lookbehind are currently supported, it'd be great to have more.

As an optimisation, it may be worth considering having input.peek create "look ranges" rather than simply setting a lookAhead value. So if the second argument to a CallExpression peeks at the function name, and then the first argument is modified, then we could check the relevant look range and see it's possible to skip re-tokenisation for the second argument.

marijnh commented 2 years ago

My first impulse is to push back on this. The system is already hugely complicated and subtle, and adding another 'mode' to it seems to invite more additional complexity and bugs than I'm willing to deal with—especially since, at this point, your strategy for incremental parsing and error-recovery sound alarmingly hand-wavey.

What is the compelling use case this would enable?

ashtonsix commented 2 years ago

adding another 'mode' to it seems to invite more additional complexity and bugs

Mmm. To help mitigate this, I would suggest greater test coverage for external tokenizers, incremental parsing and mixed parsing. I'd be happy to expand the fixture format to make creating tests for those area easier/faster, and to contribute a few new test cases, if that'd placate your concerns about this PR. For what it's worth, I've been able to do the basic (no error recovery / incremental parsing) implementation for this proposal without modifying the behaviour of what's already been built.

Is there anything else that could be done to put your mind at ease?

your strategy for incremental parsing and error-recovery sound alarmingly hand-wavey

Agreed. I'll need to get further into the working implementation before I have something more concrete, but think I'm looking in the right direction? Any suggestions you may have to help reify my incremental parsing / error-recovery strategy at the outset would be much-appreciated.

What is the compelling use case this would enable?

Sure. To contextualise, I'm excited in Lezer's potential as a language design and extension tool, so I'm optimising for "ease of creating a mixed-syntax language" rather than "ease of implementing a parser for a pre-existing mixed-syntax language". I also have an interest in building some relevant companion tools for Lezer (tree-to-source printing, tree-to-tree transform, tree-to-execution graph, programmatic grammar extension/modification).

In combination with those companion tools, the mixed-language interface I'm proposing here would enable a variety of embedded language macros. For example, let's say I want to add support for "block macros" to my language, so library authors can package things like "if/else" or "match" blocks and distribute them easily.

So, for a hypothetical, we have a language with block macros but no match block specifically. We should be able to write working code like this (where match is an external library):

import macro match from 'match'

match (res) {
  when ({ status: 200, body, ...rest }) handleData(body, rest)
  when ({ status, destination: url }) if (300 <= status && status < 400)
    handleRedirect(url)
  when ({ status: 500 }) if (!this.hasRetried) do {
    retry(req);
    this.hasRetried = true;
  }
  else throwSomething();
}

To implement block macros in the language, one would first create a context tracker to handle the "import macro" declarations, able to associate a parser with the "match" identifier. One would then create a grammar rule like BlockMacro { identifier ParenthesizedExpression BlockMacroBody }, and then an external tokenizer for BlockMacroBody that checks to see whether the identifier in BlockMacro matches an imported macro identifier, and if it does, uses the associated parser to consume the BlockMacroBody. And as an aside, I'd hope that these macros could bundle macro-specific language support (eg, syntax highlighting).

Looking further ahead, I see a future where syntax, type systems, IDE support, etc, are things a program imports rather than being features of a programming language. Essentially, unbundling many of today's programming languages into a set of reusable components. I think hand-rolling such a system would be almost impossibly complex for a human to do, but achievable with the help of tools like Lezer.

Of course, this is all for a programming paradigm that doesn't exist yet. I had a look to see whether there were any programming language features that exist today where parseMixed would be insuffcient for supporting some important feature but came up empty. Speaking candidly, it seems like pretty much all of today's languages are "near-CFG", and that you need to get pretty far into the world of context-sensitive grammar for the value proposition of this PR to start making sense. Make what you will of that.

marijnh commented 2 years ago

I think that, for this kind of experimental work, it would make sense to fork the library. In my experience, including very specific interfaces like this in libraries quickly gets me to a situation where they become part of the interface and have to be supported even after the projects that actually use them stop using them or go away entirely, to avoid breaking compatibility.

ashtonsix commented 2 years ago

Makes sense, thanks for your consideration and for creating Lezer 🙂. Closing this issue.

lezer-parser / lezer

Proposal: a mixed-language interface accessible within ExternalTokenizer instances #17

Part 1: Nested Parsing

Part 2: Tree Peeking