Extensible parsing - Githubissues

stefnotch commented 1 year ago

There are entirely too many areas of mathematics for a semantics aware editor to be able to parse them all. So parsers should be extensible, at runtime. This means defining rules on the Typescript side and passing them to Rust.

As in, someone else should be able to define a few custom additions to the default grammar and add it. And it should be possible to add multiple different custom grammars at the same time.

And library imports should always have a semantic version. (Plus encourage writing upgrading scripts.) Otherwise if I pick a bad alias for a symbol, it'll be in there forever.

Good parser libraries:

(One big requirement is that I need to parse non-text objects. And the second big requirement is that I want to dynamically build parsers at runtime.)

stefnotch commented 1 year ago

Examples

Sometimes the set builder notation with the | makes sense
Sometimes 0xe is a hexadecimal number, other times it's 0 \cdot x \cdot e

stefnotch commented 1 year ago

The heck is my poor editor supposed to do when a user types in something like this?

Picking a(bc) is not ideal, since that might imply that $p:=7$ $i:=\sqrt{-1}$ $pi:=3.14$ $f(x) = pi$ would be parsed as p * i

=> Simply stopping as soon as I encounter something valid isn't a good approach.

When the user starts typing $a$ and then accept the $ab$ autocomplete, then it's pretty darn obvious that they meant "ab". The tricky bit is that a user won't really use the autocomplete when typing $ax$ (meaning $a \cdot x$ )

=> Autocomplete should generate something that the parser can definitely and confidently parse

stefnotch commented 1 year ago

Or we could apply the parsers in reverse order.

That way, if $ln$ is a predefined rule (logarithm) and the user then defines $l := 1$ and $n := 3$, then... $ln$ would be parsed as $l \cdot n$

We would, however, need to pick a syntax for saying "wait, I actually mean the logarithm ln"

=> Applying parsers in reverse order seems like a legit strategy

stefnotch commented 1 year ago

The whole "writing two variables next to each other" deal happens quite frequently in mathematics

$ax$ being a times x
$ab$ being the concatenation of a and b (see: formal grammars and stuff)
$AB$ with A and B being vectors

And it can make sense to look ahead as much as possible

$0x$ should be 0 times x
$0x3f$ should be the hexadecimal 0x3F, as a single token

When we're at any point in the parsing stage, we want to figure out what the next token for our Pratt parser is. This also means that we have to parse symbol tokens, and operator tokens, and the two can't overlap.

=> Simply taking the "next token", separated by spaces or a multiplication sign, seems like a non-ideal strategy.

=> A greedy parser (which runs all parsers and takes the longest result) is a valid strategy for finding the next token

If we use a greedy parsing approach, and the user types $a := 3$ $bc:=1$ $ab:=7$ and then writes $abc$ we'll parse $ab \cdot c$, which c being an unknown variable. Here, the greedy parser clearly fails.

=> The greedy parser won't always be able to figure out the user's intent.

stefnotch commented 1 year ago

Multi letter names can happen naturally when

you have a function, like $arccos$
you are an engineer and have a lookup table, like for material strength
you have units, like $cm$

stefnotch commented 1 year ago

Assignment makes parsing harder

$x := 1$ $i := 2$ $\displaystyle \sum_{xi := 0} xi ^2$

stefnotch commented 1 year ago

Here's a compromise option: The parser is nice and straightforward, such that when someone writes $abc$, then that's definitely "abc" as one variable.

However, the editor is also smart:tm: and will suggest autocomplete results. Like suggesting $ab \cdot c$ when you type in $ab \cdot c$. And you can accept those autocomplete results with tab or enter.

And other things like parsing derivatives, or parsing hexadecimal numbers, can be done with the simple "try out parsers until one works" approach. Or we could whip out the "try all parsers and take the longest result" option.

Other approaches to the issue above might also be viable, this warrants further investigation.

stefnotch commented 1 year ago

For getting a single token (like a lim sup token when you have $\limsup_{x \to 0}$), we could whip out multiple approaches

Point a fully blown parser at it
Regular expressions
- https://swtch.com/~rsc/regexp/regexp1.html
- https://rcoh.me/posts/no-magic-regular-expressions-part-2/
- https://rcoh.me/posts/no-magic-regular-expressions-part-3/
- With maybe some tricks to efficiently execute lots of user-added something | something else (or) patterns, like https://stackoverflow.com/questions/14676833/combining-deterministic-finite-automata
- Check if they conflict
- https://stackoverflow.com/questions/1849447/how-can-you-detect-if-two-regular-expressions-overlap-in-the-strings-they-can-ma
- https://stackoverflow.com/questions/21662041/how-to-find-the-intersection-of-two-nfa
- https://github.com/maciejhirsz/logos#token-disambiguation
Exact matches and a "anything" for the bottom part of a $lim$

stefnotch commented 1 year ago

Parsing chains of < <= doesn't need special treatment, since it's fine if I parse $1 < x < 3$ as $1 < (x < 3)$. And then I treat $x < 3$ as a "domain restriction".

stefnotch commented 1 year ago

Changing syntax should not be as expensive later on. It should just be "change an imported library version and you're done"

stefnotch commented 1 year ago

Regarding plugins #48

stefnotch commented 1 year ago

Here's a bit of interesting info regarding Pratt parsing https://github.com/zesterer/chumsky/pull/515#issuecomment-1718173403

stefnotch / aftermath-editor

Extensible parsing #41