treeman / tree-sitter-djot

MIT License
30 stars 5 forks source link

Maybe we need to parse all token by external scanner #54

Open wrvsrx opened 1 month ago

wrvsrx commented 1 month ago

As we know, although it's easy to parse djot by hand, djot is highly context sensitive, so parsing djot using tree-sitter is a hardwork. Thanks for your great work!

I notice this tree-sitter parses block level token by external scanner. However, it seems that inline level token also need to be parsed using external scanner. A lot of problems come from that (https://github.com/treeman/tree-sitter-djot/issues/38).

To achive that, we need to maintain two stacks, one block level and one inline level and parse block token and inline token using external scanner.

  1. At eol, we need to determine how many block we need to close and what block we need to start by peeking next line.
  2. At each block start, we need to determine whether we should start to parse inline or start another block.
  3. When we parsing inline, we need to push all possible inline openers into the inline stack and change them to actural token when we parse a potential closer.
treeman commented 1 month ago

Yeah, I've reluctantly reached the same conclusion. When I started writing this grammar, naive as I was, I wanted to try to keep things as simple as possible but the various inline rules in grammar.js are already too complex, and I don't really see a way to solve the issues properly without handling them in an external scanner.

This will be a big rewrite and I think we need a look-ahead for the whole block. For example, a single _ shouldn't start an emphasis and we can only know if there's a second _ in the block if we parse until the end-of-block (paragraph end, div end, etc). It should be possible but there are quite a lot of tricky edge-cases here so it'll take some time to implement.

Still, this is the direction we need to go.

black-desk commented 1 month ago

But tree-sitter will serialize your scanner on every AST node even in the error AST. Looking ahead too much and keeping a large context might lead to a performance issue.

treeman commented 1 month ago

I admit that I'm out of my depth and I don't know how you're supposed to design a tree-sitter grammar.

My idea though wasn't to use error tokens at all and instead scan ahead to know what token to output.

For example if a * is found:

  1. Scan the entire inline context.
  2. If another * is found, we keep track of that and continue scanning.
  3. At the end of the context, we can decide if the first * starts a strong element, or if that should be left to another shorter * pair that comes after.

If we implement this naively then we'll end up scanning the same inline context multiple times, but I hope we can cache that somehow.

This does mean we need to be able to scan the entire inline context (including links etc) so we can decide if we should count a * or not. Maybe this means completely forgoing any definitions in grammar.js... The unfortunate thing about an external scanner is that it sort of side-steps tree-sitters backtracking algorithm, but I'm really not sure how that interaction works.

Maybe there's a better way to design this, I don't know.

black-desk commented 1 month ago

I’m not entirely sure what you’re referring to. Here, I’ll describe the potential issues I can think of more clearly:

Tree-sitter frequently serializes and deserializes the scanner during the use of the external scanner. You can see the specifics here: https://github.com/tree-sitter/tree-sitter/blob/7583d394b41a9ab26ae6d52d51c8965f627996cd/lib/src/parser.c#L531-L576

To achieve incremental parsing, Tree-sitter stores the scanner’s state every time the scanner produces a token.

This means we can’t actually store a lot of context information in the scanner, otherwise, the space overhead during the parsing process would be significant.

Considering the amount of context information retained by the inline parser in the djot.js implementation (https://github.com/jgm/djot.js/blob/00d273e2c2d72b0eba7f463da0acf5ba2f9e4843/src/inline.ts#L527), I think looking ahead an entire block is unrealistic.

treeman commented 1 month ago

Yeah I agree that we need to hold the scanner context to a minimum.

I guess there's a trade-off here in repeatedly scanning the same information and storing it in the context. Scanning the entire inline context inside the external scanner doesn't seem to incur any serialization costs, that happens when the external scanner produces a token.

If we can't look ahead then I don't see how we can decide between a longer and a shorter emphasis element?

wrvsrx commented 4 weeks ago

tree-sitter-markdown use two level tree-sitter-parser, one for block and another for inline. Can we refer to it?

treeman commented 4 weeks ago

We can't simply refer to it as the markdown inline rules are different from the djot inline rules.

Perhaps its a good idea to split the parser into two, the same as markdown though? It would make some things easier and maybe more performant, with the drawback of the user having to install two parsers instead of one.

treeman commented 3 weeks ago

I've tried a few things:

  1. Separate the scanner into two grammars, like tree-sitter-markdown
  2. Move some inline tokens to an external scanner

There's still a lot of things left to implement, but the initial feeling is good. I've been able to solve token precedence issues such as parsing *not strong *strong* and with an external scanner we'll be able to ignore elements containing only delimiter tokens such as ___.

This will be a breaking change though and everyone would now need to update their existing queries and use two grammars.

wrvsrx commented 2 weeks ago

I guess there's a trade-off here in repeatedly scanning the same information and storing it in the context. Scanning the entire inline context inside the external scanner doesn't seem to incur any serialization costs, that happens when the external scanner produces a token.

Maybe it's a more reasonable option to scanning the same information repeatedly? We just need to maintain a block stack and a inline stack. We only emit inline token when we can determine what the next token is.

treeman commented 2 weeks ago

Maybe it's a more reasonable option to scanning the same information repeatedly? We just need to maintain a block stack and a inline stack. We only emit inline token when we can determine what the next token is.

Yeah, I think I have a prototype of a solution I think should work. I'm only storing the inline stack (with some extra data on it) and we're scanning a little bit more (until the next ending element).

This shouldn't be a performance bottleneck as we unfortunately have to branch out every time a beginning token is found via the treesitter LR conflict that should be a lot more expensive than the small extra scan we're doing.

treeman commented 1 week ago

I've created a merge request from the split branch that implements the split parser.

I haven't created any "package files" in the root directory like there is in tree-sitter-markdown as everything just lives in their separate folders right now.

If anyone can test it out that would be fantastic.