Open uben0 opened 9 months ago
Don't know other editors' support, but tree sitter in Emacs can support embedding other tree sitter languages in one tree sitter language document. Dividing Typst tree sitter parser into three parsers may be a direction to consider. In this situation, performance of editing partly depends on the editor side (language embedding implementation).
I thought about this and it could be a great way to simplify the parser, but there are several problems:
But I do believe that it is possible to identify in which mode we are when the external scanner is invoked. In my understanding, there are 5 different situations:
And this is not even a perfect split, as some tokens are valid over different modes, and some, not even valid from time to time in a single mode. For instance, the comments and white spaces are always valid. And the =
, as heading markup in text mode, is only valid at the beginning of a line or a container like strong or emphasis markups.
I am thinking about writing the scanner in Zig, because Zig code can be transpiled to C code, and as the external scanner interacts with the tree sitter parser, only through C abi, it should work. I'll make some test to check if it works and if it doesn't increase the size of the final parser, because generated C code from Zig contains lots of boilerplate code, which I hope gets optimized away.
Simple grammar = maintainable grammar
This grammar has to be maintainable. To be maintainable, the grammar has to be simple.
One way to keep things simple is to do things naively. But a naive Typst grammar has terrible performances.
For instance, when inserting optional white spaces and comments every where it is allowed, it increases the parser size tremendously and clutters the grammar's readability.
We can use the builtin
extra
feature (making the given rules optional everywhere) but this is not possible as some rules have to be directly sequential, like a function call (foo()
=foo
~(
), no space nor comment allowed.External scanner
The external scanner can be used to implement a work around the previous problem. Moreover, to handle indentation, there is no other way than using the external scanner.
To sum up, the use of the external lexer is mandatory to enable some functionalities and prevent the parser from becoming too big by implementing shortcuts.
Full external scanning
In the current situation, some tokens are parsed by the external scanner and some tokens are parsed by the builtin scanner. To improve the situation, it would be better to only use the external scanner. But current external scanner has become too complex and prevents a full migration.
The new external scanner will have to be, at least partially, generated because some token, like builtin functions in Typst, have to be parsed through a switch subdivision (like
ref
andrange
):This can be achieved by producing a finite state machine from a given regex list, but it has to take in account that some token are not always valid.
Finite state machine with conflicting acceptation
Because in Typst there are different modes (text, code and math), some tokens are in collision. For instance, the input
f
would be accepted as plain text in text mode, as an identifier in code mode and as a letter in math mode.If an automatic generation of a finite state machine is implemented, it has to take in account those collisions.
I will experiment and look for solutions. If anyone have an idea on how to do it and what kind of challenges does it lead to, you can share it here.