Closed ashtonsix closed 2 years ago
My first impulse is to push back on this. The system is already hugely complicated and subtle, and adding another 'mode' to it seems to invite more additional complexity and bugs than I'm willing to deal with—especially since, at this point, your strategy for incremental parsing and error-recovery sound alarmingly hand-wavey.
What is the compelling use case this would enable?
adding another 'mode' to it seems to invite more additional complexity and bugs
Mmm. To help mitigate this, I would suggest greater test coverage for external tokenizers, incremental parsing and mixed parsing. I'd be happy to expand the fixture format to make creating tests for those area easier/faster, and to contribute a few new test cases, if that'd placate your concerns about this PR. For what it's worth, I've been able to do the basic (no error recovery / incremental parsing) implementation for this proposal without modifying the behaviour of what's already been built.
Is there anything else that could be done to put your mind at ease?
your strategy for incremental parsing and error-recovery sound alarmingly hand-wavey
Agreed. I'll need to get further into the working implementation before I have something more concrete, but think I'm looking in the right direction? Any suggestions you may have to help reify my incremental parsing / error-recovery strategy at the outset would be much-appreciated.
What is the compelling use case this would enable?
Sure. To contextualise, I'm excited in Lezer's potential as a language design and extension tool, so I'm optimising for "ease of creating a mixed-syntax language" rather than "ease of implementing a parser for a pre-existing mixed-syntax language". I also have an interest in building some relevant companion tools for Lezer (tree-to-source printing, tree-to-tree transform, tree-to-execution graph, programmatic grammar extension/modification).
In combination with those companion tools, the mixed-language interface I'm proposing here would enable a variety of embedded language macros. For example, let's say I want to add support for "block macros" to my language, so library authors can package things like "if/else" or "match" blocks and distribute them easily.
So, for a hypothetical, we have a language with block macros but no match block specifically. We should be able to write working code like this (where match
is an external library):
import macro match from 'match'
match (res) {
when ({ status: 200, body, ...rest }) handleData(body, rest)
when ({ status, destination: url }) if (300 <= status && status < 400)
handleRedirect(url)
when ({ status: 500 }) if (!this.hasRetried) do {
retry(req);
this.hasRetried = true;
}
else throwSomething();
}
To implement block macros in the language, one would first create a context tracker to handle the "import macro" declarations, able to associate a parser with the "match" identifier. One would then create a grammar rule like BlockMacro { identifier ParenthesizedExpression BlockMacroBody }
, and then an external tokenizer for BlockMacroBody
that checks to see whether the identifier in BlockMacro
matches an imported macro identifier, and if it does, uses the associated parser to consume the BlockMacroBody
. And as an aside, I'd hope that these macros could bundle macro-specific language support (eg, syntax highlighting).
Looking further ahead, I see a future where syntax, type systems, IDE support, etc, are things a program imports rather than being features of a programming language. Essentially, unbundling many of today's programming languages into a set of reusable components. I think hand-rolling such a system would be almost impossibly complex for a human to do, but achievable with the help of tools like Lezer.
Of course, this is all for a programming paradigm that doesn't exist yet. I had a look to see whether there were any programming language features that exist today where parseMixed
would be insuffcient for supporting some important feature but came up empty. Speaking candidly, it seems like pretty much all of today's languages are "near-CFG", and that you need to get pretty far into the world of context-sensitive grammar for the value proposition of this PR to start making sense. Make what you will of that.
I think that, for this kind of experimental work, it would make sense to fork the library. In my experience, including very specific interfaces like this in libraries quickly gets me to a situation where they become part of the interface and have to be supported even after the projects that actually use them stop using them or go away entirely, to avoid breaking compatibility.
Makes sense, thanks for your consideration and for creating Lezer 🙂. Closing this issue.
Since opening a thread about mixed-language parsing I've continued to explore this topic, and am now looking at some relevant changes to Lezer. (And am most-way through an implementation).
It'd be great to ascertain whether you (@marijnh) have any interest in making these changes part of Lezer, and if you do, whether I'm going in the right direction for a future pull request. (If we can't find alignment here I'll probably end up creating a fork of Lezer).
Part 1: Nested Parsing
Proposal:
The
ExternalTokenizer
should be able to run a nested parse and save the result for later. And for some cases, be able to automatically figure out the length of the nested parse.Interface:
Description (
Stack.nestedParse
):Stack.nestedParse()
does a parse fromstack.pos
tostack.pos + length
.length
is required for non-LR parsers, and optional otherwise. Iflength
is not provided, and either the inner or outer parser is strict, the inner parse grabs as much valid code as it can, ceding control to the outer parse when it stops or encounters an unexpected token.Incremental Parsing Even if the resulting tree isn't eventually mounted the result of
Stack.nestedParse()
is recorded on the outer parse and wired up in such a way that it may be reused during a future parse. At the end of each parse any unused trees from previous parses are deleted, to free them from memory.Error Recovery If both parsers are non-strict (and
length
is not provided), the outer parse is split (once for each token the external tokenizer can shift), jumps ahead to the inner parse location, and checks each outer parse stack for actions it can take. If any of those stacks can advance then control is ceded to the outer parse. If neither parse can advance normally they both enter recovery mode; if the inner parse recovers first, it continues; if the outer parse recovers first (or neither recovers within 20-ish steps), the inner parse is ended at the unexpected token. The outer parse is then rewound to its state from beforeStack.nestedParse()
was called.Overlays I haven't yet figured out what to do when the nested parse specifies an overlay. For now I propose throwing an error.
Description (
InputStream.acceptToken
):The full signature for
InputStream.acceptToken()
will look like:If a tree is given we infer the
endOffset
fromtree.length
, and mark the accepted token as a mount, and add the tree to the parse (if it wasn't already added byStack.nestedParse
).Sub-Proposal:
Add an option for
NestedParse.top
, as in,Stack.nestedParse({parser: javascript, top: [expressionNoComma]})
. Which overwritesLRParser.topRules
for that parse.Part 2: Tree Peeking
Proposal:
The
ExternalTokenizer
should be able to decide what to accept based on contextual cues, theInputStream
API enables this to a limited extent, but for more complicated cases it'd be great to inspect a "local tree".Interface:
Description:
Stack.resolveTree()
identifies the shortest sequence of actions to reduce any of the giventop
terms and then returns a tree with that node at the top. All tokens that appear afterstack.pos
are artificially synthesized and have a length of 0. To "find the shortest sequence" we'll use ranked BFS, with priority given to reduce actions and closing delimiters (slightly faster than A* search), and never visit the same state twice. In the worst case where no path to the given top node existsStack.resolveTree()
will visit each state once. We'll cache the sequence of actions based on the combination ofstack.state
andtop
(this cache will be stored on the LRParser instance). We'll also reuse some work for the backwards direction, making use of previous parse trees on the stack buffer (as indicated by "-1 size terms"). Since this looks at the stack rather than input this method shouldn't need to create any performance-impacting dependencies the wayInputStream.peek()
does. I tried to figure out a way the call tobuildTree()
(from @lezer/common) could be cached/reused but haven't found a suitable approach, so think that would need to run once for everyStack.resolveTree()
call.Stack.splitShift()
splits the stack and adds one artificially synthesized 0-length term. One would use this method if they want to ensure the sequence of actions found byStack.resolveTree()
begins with a particular term, for example, by doingstack.splitShift(EmbedArgument).resolveTree([CallExpression, NewExpression])
. I guessStack.splitShift()
could also be used forLR(2)
lookaheads but I haven't given that use case much thought.Sub-Proposal:
Once one actually has a local tree one is likely to want to peek at the relevant input, for example, to see the function name for a JavaScript
CallExpression
. One can do this withInputStream.peek()
today, but that isn't a great approach for two reasons:peek()
only looks at one code unit per invocation, so I suggest adding an overload likeInputStream.peek(from: number, to: number)
, to make grabbing all the input for a given tree node more convenient.input.peek(4, 6)
would be equivalent toinput.peek(4) + input.peek(5)
, matching the semantics ofstring.slice()
.As an optimisation, it may be worth considering having
input.peek
create "look ranges" rather than simply setting alookAhead
value. So if the second argument to aCallExpression
peeks at the function name, and then the first argument is modified, then we could check the relevant look range and see it's possible to skip re-tokenisation for the second argument.