Closed jchitel closed 6 years ago
Ok, so we have three types of expansions:
Sequential expansions specify a list of expansions to be accepted in sequence, where each expansion is applied as a property to some CST node object.
Choice expansions specify a list of expansions to be tried, and the first one that matches will be applied to a choice
property of some CST node object.
Left-Recursive expansions specify two lists: a list of base expansions, which are accepted the same way as a basic choice expansion; and a list of suffixes, which are accepted any number of times after one base has been accepted. Each accepted suffix takes the previous base, inserts it into itself at a specified property, and sets itself as the next base. The first iteration that all suffixes fail, the previous base is returned as the result.
We want to add two primary changes:
One major difference here is that we will no longer be using containers for choices. What we should do is go back to the original idea of specifying property names on choices so that we don't need to apply a bunch of union types to a single property. But some choices should still share the same property name.
So, here are the results of each type:
Some rules:
So what does this mean for typing:
I added a branch with the initial work for this: new-parser-logic.
It is currently not worth doing this right now. It would be better to wait until we can properly type everything, because with the current state of TypeScript the whole thing comes out looking very obnoxious.
Additionally, something like this will be very well-suited for a DSL. This is documented here.
So, I ended up finding a completely different way to refactor the parser logic. There isn't a whole lot that can be improved from this point, aside from:
However, for now, this is done.
For some reason I keep coming back to this crap, because I'm never satisfied.
Well, guess what, I'm finally satisfied. The new parser logic is dead simple, and based around building composed functions.
All parse functions (type ParseFunc<T> = (parser: Parser) => ParseResult<T>
, where interface ParseResult<T> { result: Optional<T>; remaining: Parser }
) take a parser object and return a result object containing the parsed object and the parser containing the remaining tokens in the token stream. This is a pure functional interface.
The base function is a token parsing function, tok(string | TokenType)
, which takes a token image or type, and returns a ParseFunc
that parses tokens of that image or type. If the parse is unsuccessful, null
is returned as the result, and the parser will contain internal state about the token that was found instead.
The next function for building ParseFunc
s is seq<...T, R>(...ParseFunc<...T>, ([...T]) => R)
. Basically, it takes a list of parse functions, each of which will return results of its own unique type. Each of those will be parsed in sequence, and if they are all successfully parsed, the results are collected into an array and passed into the specified function to convert the array to some result object. The result is a parse function that will return the aggregated result object. If any of the specified functions fails, the whole sequence will fail, returning the failed parser for the item that failed.
The next function is select<T>(...ParseFunc<T>)
, which is meant to parse one of several parse functions. The order is important, because each function will be attempted one by one, and the first one to succeed will return. The resulting type is a union of all of the result types of the specified functions. If all of them fail, a failure is returned containing the next token in the stream.
The last two functions are "qualifier" functions, the simpler of which is optional<T>(ParseFunc<T>): ParseResult<Optional<T>>
. All ParseResults
contain an optional result (in the instance of a failure), but this builder treats failures as successes, so the actual resulting value may be null. It's fairly simple. It calls the specified function, and converts any failure to a success, returning null as the result.
The last function is repeat<T>(ParseFunc<T>, '+' | '*', ?ParseFunc<{}>): ParseResult<T[]>
. This is probably the fanciest builder function. It repeatedly parses the specified function, collecting each successful result into an array, and then returns the array when it hits the first failure. The additional parameters add sugar on top of this logic. If a '*' is passed for the second parameter, this specifies normal behavior. If '+' is passed instead, then at least one repetition will be required. The last parameter is a separator function. If it is specified, it means that the specified separator will be required between each separator. These separators are ignored from the result because they are never used (yet, but changing the logic to include them would be trivial).
Because the parse functions can have arbitrary structure and return arbitrary values, this structure is extremely expressive and flexible. For example, one thing that is not directly included in this system is left-recursion. This was solved simply by specifying two types for each left-recursive syntax type: one formal type that will be ultimately returned, and an intermediate suffix type, which ignores the first item in the sequence and adds a transform function to take the first item after parsing and use it to produce the formal type. The suffix is what is actually parsed, and a combination of seq
, select
, and repeat
can be used to form the full left-recursive type.
I am officially closing-closing this issue because I have a now near-perfect parsing system with extremely minimal overhead. I love it.
While the "new" (relatively speaking) parser logic is far superior to the original logic, it is still overly verbose in places, particularly where complex expansions may result in sequences containing choices containing sequences and so on. The logic is perfectly capable of supporting these kinds of expansions, but it currently requires separate non-terminals for each "branch" in parser logic. This results in having a great number of non-terminals that ultimately end up getting dropped during reduction.
If we can allow sub-parses to be specified in-place as opposed to a separate accept() function, then we won't require a massive list of intermediate non-terminals, and thus we won't require a massive list of CST nodes.
Thinking further, we may not even need CST node classes at all, as now they are used only as containers to pass from the parser to the reducer functions.
This is what I envision as the process:
TODO: it may be useful to keep the reduce() and accept() logic split up, so we can implement this in two layers for testing. Adjust this stuff to account for that.