Further refactor parser logic

While the "new" (relatively speaking) parser logic is far superior to the original logic, it is still overly verbose in places, particularly where complex expansions may result in sequences containing choices containing sequences and so on. The logic is perfectly capable of supporting these kinds of expansions, but it currently requires separate non-terminals for each "branch" in parser logic. This results in having a great number of non-terminals that ultimately end up getting dropped during reduction.

If we can allow sub-parses to be specified in-place as opposed to a separate accept() function, then we won't require a massive list of intermediate non-terminals, and thus we won't require a massive list of CST nodes.

Thinking further, we may not even need CST node classes at all, as now they are used only as containers to pass from the parser to the reducer functions.

This is what I envision as the process:

Several accept() functions are used to build the parser. The definitions within these functions contain the logic for building expansions, and the components within these expansions can specify sub-parsers, which can either be more expansions specified in-place, or references to other accept() functions.
Some high-level accept() functions can also specify a reduce() function that will transform the parsed result to an AST node instance.
All of this, including the nesting and reduce() references, will be strongly-typed. The old CST node classes will be replaced with type definitions for these structures, and they will only be needed for the nodes that are passed to the reduce() functions, so they will specify the same nesting that the accept() functions specify.
When an accept() function is called, the AST node is spit back out.

TODO: it may be useful to keep the reduce() and accept() logic split up, so we can implement this in two layers for testing. Adjust this stuff to account for that.

Ok, so we have three types of expansions:

Sequential
Choice
Left-Recursive

Sequential expansions specify a list of expansions to be accepted in sequence, where each expansion is applied as a property to some CST node object.

Choice expansions specify a list of expansions to be tried, and the first one that matches will be applied to a choice property of some CST node object.

Left-Recursive expansions specify two lists: a list of base expansions, which are accepted the same way as a basic choice expansion; and a list of suffixes, which are accepted any number of times after one base has been accepted. Each accepted suffix takes the previous base, inserts it into itself at a specified property, and sets itself as the next base. The first iteration that all suffixes fail, the previous base is returned as the result.

We want to add two primary changes:

CST nodes won't be class instances anymore, just regular objects, but still type checked via some type definitions.
Expansions of all kinds can be nested, which would effectively replace an accept() reference with the contents of that function. This nesting will also be type checked.

One major difference here is that we will no longer be using containers for choices. What we should do is go back to the original idea of specifying property names on choices so that we don't need to apply a bunch of union types to a single property. But some choices should still share the same property name.

So, here are the results of each type:

Sequential: yields an object with sequential properties at the top-level.
Choice/Left-Recursive: yields an object containing the chosen result with a specific property name. This property is intended to be merged into the parent.

Some rules:

Choices/Left-Recursives should never contain other choices/left-recursives, because these can be easily merged. Sequences can contain other sequences because they have modifiers that can be applied to them.
Whatever is returned from a choice/LR expansion should be merged (effectively "unboxed") into the parent.

So what does this mean for typing:

Names in sequences must be present on the specified type.
- Nice to have: optional sequences must be optional on the specified type.
- Nice to have: repetition sequences must be lists
Names in choices must be present on the specified type, which should be forwarded from the parent sequence.
The base name of a left-recursive suffix must be present on that suffix's type (still need to work out how this will work).

I added a branch with the initial work for this: new-parser-logic.

It is currently not worth doing this right now. It would be better to wait until we can properly type everything, because with the current state of TypeScript the whole thing comes out looking very obnoxious.

Additionally, something like this will be very well-suited for a DSL. This is documented here.

So, I ended up finding a completely different way to refactor the parser logic. There isn't a whole lot that can be improved from this point, aside from:

automatic error fallback, proper error handling
left-recursive handling can probably be done better.

However, for now, this is done.

For some reason I keep coming back to this crap, because I'm never satisfied.

Well, guess what, I'm finally satisfied. The new parser logic is dead simple, and based around building composed functions.

All parse functions (type ParseFunc<T> = (parser: Parser) => ParseResult<T>, where interface ParseResult<T> { result: Optional<T>; remaining: Parser }) take a parser object and return a result object containing the parsed object and the parser containing the remaining tokens in the token stream. This is a pure functional interface.

The base function is a token parsing function, tok(string | TokenType), which takes a token image or type, and returns a ParseFunc that parses tokens of that image or type. If the parse is unsuccessful, null is returned as the result, and the parser will contain internal state about the token that was found instead.

The next function for building ParseFuncs is seq<...T, R>(...ParseFunc<...T>, ([...T]) => R). Basically, it takes a list of parse functions, each of which will return results of its own unique type. Each of those will be parsed in sequence, and if they are all successfully parsed, the results are collected into an array and passed into the specified function to convert the array to some result object. The result is a parse function that will return the aggregated result object. If any of the specified functions fails, the whole sequence will fail, returning the failed parser for the item that failed.

The next function is select<T>(...ParseFunc<T>), which is meant to parse one of several parse functions. The order is important, because each function will be attempted one by one, and the first one to succeed will return. The resulting type is a union of all of the result types of the specified functions. If all of them fail, a failure is returned containing the next token in the stream.

The last two functions are "qualifier" functions, the simpler of which is optional<T>(ParseFunc<T>): ParseResult<Optional<T>>. All ParseResults contain an optional result (in the instance of a failure), but this builder treats failures as successes, so the actual resulting value may be null. It's fairly simple. It calls the specified function, and converts any failure to a success, returning null as the result.

The last function is repeat<T>(ParseFunc<T>, '+' | '*', ?ParseFunc<{}>): ParseResult<T[]>. This is probably the fanciest builder function. It repeatedly parses the specified function, collecting each successful result into an array, and then returns the array when it hits the first failure. The additional parameters add sugar on top of this logic. If a '*' is passed for the second parameter, this specifies normal behavior. If '+' is passed instead, then at least one repetition will be required. The last parameter is a separator function. If it is specified, it means that the specified separator will be required between each separator. These separators are ignored from the result because they are never used (yet, but changing the logic to include them would be trivial).

Because the parse functions can have arbitrary structure and return arbitrary values, this structure is extremely expressive and flexible. For example, one thing that is not directly included in this system is left-recursion. This was solved simply by specifying two types for each left-recursive syntax type: one formal type that will be ultimately returned, and an intermediate suffix type, which ignores the first item in the sequence and adds a transform function to take the first item after parsing and use it to produce the formal type. The suffix is what is actually parsed, and a combination of seq, select, and repeat can be used to form the full left-recursive type.

I am officially closing-closing this issue because I have a now near-perfect parsing system with extremely minimal overhead. I love it.

jchitel / renlang

Further refactor parser logic #10