Incremental and error-resilient parser

novusnota commented 5 months ago

Would greatly help with #183.

The basic idea is to move from the current Ohm-generated parser to a standalone one, which would be suitable for tools like Language Server or others. Additionally, it would be nice to keep the existing grammar.ohm spec file — it will allow us to check the implementetion of the new parser against an Ohm-generated one.

List of ideas, papers and useful blog-posts (organized by year, in decreasing order):

Some more things:

The Lezer Parser System: https://lezer.codemirror.net
https://github.com/zesterer/chumsky

novusnota commented 5 months ago

But until then, we must utilize the most out of Ohm's parser, including its incremental parsing capabilities and try to enhance its syntax error reporting clarity locally or in it's upstream repo. It's actually seems to be not that far from giving a structured error message instead of a pre-formed string (although if worst comes to worst, we could parse the resulting string however brittle this may seem).

Note, that it's not strictly necessary to eventually make an incremental parser, but it's a must to make it error-resilient and producing nice errors (see: #183).

0xGeorgii commented 5 months ago

hey @novusnota this is a great idea and a huge boost for tact infrastructure! I am interested in implementing this part but there are many details for such a big project that need to be settled down before approaching it. here are some of the thoughts and questions I would like to raise and discuss here:

what are the key non-functional qualities required for the parser? 1.1 performance? should it be performance-optimized and be able to possess N tokens in a second? if yes, is there a desired value for that or a reference performance example (like to be not slower than Ohm)? 1.2 resilience? should it be able to produce a stable AST for the broken syntax? 1.3 do we need a fuzzer to validate pp. 1.1 and 1.2 ?
should it be stateful? keeping in mind an IDE infrastructure pipeline IDE <-> LSP <-> [parser; compiler] ? 2.1 now, here it means incremental parsing is required because we probably do not want to reiterate the whole tree but only the changed branch.
do we need a custom lexer?
a smart internal structure design is required to provide a painless way to reflect tact syntax updates.
~~is a static AST analyser needed?~~

The target language is assumed to be ts to keep the variety of used technologies dense. Since Ohm is currently used, the new parser can adopt and take some approaches from it (to reduce reinventing the wheel phase).

anton-trunov commented 5 months ago

is a static AST analyser needed?

This is definitely out-of-scope for the parser project. There are two static analyzer projects supported by the grants and bounties program:

novusnota commented 5 months ago

@0xGeorgii this is a big project, but we don't need to rush out the implementation before we prototype some more with Ohm and check off some points from #314 with it. Moreover, #183 has to be addressed with the current Ohm parser we have.

That said, Ohm won't be going anywhere even when we'll have a new parser — it's always nice to have a formal specification/reference, so Ohm's grammar.ohm and auto-generated parser would suit as that reference.

Now, to your questions:

1.1 performance?

I think just being faster than Ohm would do, but concrete metrics are up for discussion.

1.2 resilience? should it be able to produce a stable AST for the broken syntax?

Yes, that's the point :) Note, that this is different from error-recovery in a sense that the parser should just not fall in presence of failures instead of trying to correct some of them (by placing a pseudo-node with ); when calling functions like myAddress(|, where | is the caret position, for example). Although I'm not sure which strategy would be the most optimal, there's one thing for sure — this parser should work in editor/IDE environments too, where code is, most of the time, in the broken state.

Oh, and incrementality — lexer/parser combo should be able to start from a specified rule and not re-lex/re-parse the whole tree. This is very important for editor/IDE environments with frequent changes of the source code. (We may introduce a debounce or something here, just to keep things simple, but that's still a thing to keep in mind nonetheless).

At the moment, I lean towards having a combined LL (with some tricks) + Pratt (for expressions) parser architecture. But this is not set in stone :)

1.3 do we need a fuzzer to validate pp. 1.1 and 1.2?

Not sure here, but it would be nice.

do we need a custom lexer?

Yes, and it also must be error-resilient — just marking invalid tokens and going forward should be enough.

a smart internal structure design is required to provide a painless way to reflect tact syntax updates.

That's what the Ohm's grammar for — syntax will first be tested in it, then those changed would be matched by the new parser. At least that's how I view it.

Some important questions (IMHO) weren't asked:

What to do with comments? Especially multi-line ones? Do we attach them to the AST after the first pass, or...
What to do with errors? How could we fine-tune them for: end-users on the CLI, API usage in tools, etc.
How would the API for external tools and CLI would look like? Related to #314 and other similar issues.

Summoning @anton-trunov to correct me about our plans and/or my points here :)

tact-lang / tact

Incremental and error-resilient parser #286