uben0 commented 9 months ago

Simple grammar = maintainable grammar

This grammar has to be maintainable. To be maintainable, the grammar has to be simple.

One way to keep things simple is to do things naively. But a naive Typst grammar has terrible performances.

For instance, when inserting optional white spaces and comments every where it is allowed, it increases the parser size tremendously and clutters the grammar's readability.

We can use the builtin extra feature (making the given rules optional everywhere) but this is not possible as some rules have to be directly sequential, like a function call (foo() = foo ~ ( ), no space nor comment allowed.

External scanner

The external scanner can be used to implement a work around the previous problem. Moreover, to handle indentation, there is no other way than using the external scanner.

To sum up, the use of the external lexer is mandatory to enable some functionalities and prevent the parser from becoming too big by implementing shortcuts.

Full external scanning

In the current situation, some tokens are parsed by the external scanner and some tokens are parsed by the builtin scanner. To improve the situation, it would be better to only use the external scanner. But current external scanner has become too complex and prevents a full migration.

The new external scanner will have to be, at least partially, generated because some token, like builtin functions in Typst, have to be parsed through a switch subdivision (like ref and range):

switch (next) {
    case 'r':
    switch (next) {
        case 'e':
        ...
        break;
        case 'a':
        ...
        break;
    }
    break;
}

This can be achieved by producing a finite state machine from a given regex list, but it has to take in account that some token are not always valid.

Finite state machine with conflicting acceptation

Because in Typst there are different modes (text, code and math), some tokens are in collision. For instance, the input f would be accepted as plain text in text mode, as an identifier in code mode and as a letter in math mode.

If an automatic generation of a finite state machine is implemented, it has to take in account those collisions.

I will experiment and look for solutions. If anyone have an idea on how to do it and what kind of challenges does it lead to, you can share it here.

Ziqi-Yang commented 9 months ago

Don't know other editors' support, but tree sitter in Emacs can support embedding other tree sitter languages in one tree sitter language document. Dividing Typst tree sitter parser into three parsers may be a direction to consider. In this situation, performance of editing partly depends on the editor side (language embedding implementation).

uben0 commented 9 months ago

I thought about this and it could be a great way to simplify the parser, but there are several problems:

To have embedded code, a parent language has to be able to identify the character range of the embedded source by itself. With Typst, it would mean to determine in text mode, where an embedded code expression ends. To do so, you have to parse correctly the code. So it comes back to the original problem.
Most editors don't offer the same support, it gives to the main source, to the embedded language. For instance, in Helix, embedded code is highlighted, but text can't be selected following the hierarchy.

But I do believe that it is possible to identify in which mode we are when the external scanner is invoked. In my understanding, there are 5 different situations:

Error recovery. Which is when the parsing failed (invalid syntax) and Tree Sitter tries to recover by accepting any token and resume parsing.
Expression mode (or code mode). When an expression is valid next, any token starting an expression is accepted.
Math expression mode (or math mode). When a math expression is valid next, any token starting a math expression is accepted.
Text mode. When plain text is accepted.
Specific token. When a token, only valid in a single and specific situation has to be parsed.

And this is not even a perfect split, as some tokens are valid over different modes, and some, not even valid from time to time in a single mode. For instance, the comments and white spaces are always valid. And the =, as heading markup in text mode, is only valid at the beginning of a line or a container like strong or emphasis markups.

uben0 commented 9 months ago

I am thinking about writing the scanner in Zig, because Zig code can be transpiled to C code, and as the external scanner interacts with the tree sitter parser, only through C abi, it should work. I'll make some test to check if it works and if it doesn't increase the size of the final parser, because generated C code from Zig contains lots of boilerplate code, which I hope gets optimized away.

uben0 / tree-sitter-typst

Full rewritte of external scanner #19

Simple grammar = maintainable grammar

External scanner

Full external scanning

Finite state machine with conflicting acceptation