markdown_inline can't reparse old trees (web-tree-sitter)

tree-sitter-grammars / tree-sitter-markdown

Markdown grammar for tree-sitter

MIT License

375 stars 45 forks source link

markdown_inline can't reparse old trees (web-tree-sitter) #92

Closed savetheclocktower closed 4 months ago

savetheclocktower commented 1 year ago

I can reproduce this one in the tree-sitter playground on the latest master. From the root:

cd tree-sitter-markdown-inline
tree-sitter build-wasm . && tree-sitter playground

Wait for the playground to open. Type some text with markdown formatting. The tree will not reflect the content.

Select everything you've written, cut it, then paste it again. The tree should be correct.

Change anything — even add a space at the very end of the line — and the tree will start to incorrectly shed nodes.

I was able to reproduce this just by pasting and editing the following text:

Lorem ipsum _dolor_ sit amet, consecuetur _adipiscing_ elit. Lorem ipsum _dolor_ sit amet, consecuetur _adipiscing_ elit. Lorem ipsum _dolor_ sit amet, consecuetur _adipiscing_ elit.

After pasting that content, add a space character anywhere, and you'll notice that the first emphasis node disappears from the tree. The next change (even just deleting the new space) will cause the second emphasis node to disappear, and so on.

This is a symptom of what I observed on my own with the web-tree-sitter bindings: if you call parser.parse(someText, null), everything parses fine; if you call parser.parse(someText, existingTree), nodes get lost. I have no idea if this occurs under the rust or node bindings, but I've worked with at least a dozen tree-sitter parsers under web-tree-sitter and this is the first time I've observed any symptom like this.

MDeiml commented 1 year ago

Hm, I'll have to investigate, but on first glance this seems like a problem with tree-sitter itself and not this specific parser. For "proper" parsers it shouldn't be possible to get this behaviour, as the parser does not actually define the incremental parsing process. Instead incremental parsing is handled implicitly by the tree-sitter library.

The question is of course if this parser is "proper" since it does have a substantial amount of handwritten C code, but I'll have a look at that. Otherwise it might well be that this specific parser hits some edge-case bug in tree-sitter, since it is at least one of the most complex tree-sitter parsers out there and it uses a lot of "conflicts" which are probably the most problematic functionality of tree-sitter.

clason commented 1 year ago

Isn't this exactly https://github.com/tree-sitter/tree-sitter/issues/2145?

savetheclocktower commented 1 year ago

Isn't this exactly tree-sitter/tree-sitter#2145?

No, not that I can tell. The symptom described in that issue is that the offsets don't change. (At any rate, I can't reproduce that symptom on the official tree-sitter playground when following the exact instructions described by the issue author.)

The symptom in this issue is that the tree does re-parse, and (as far as I can tell) offsets do change, but something about the change seems to cause the markdown_inline parser to drop, or fail to recognize, nodes — seemingly exactly one node with each re-parse.

savetheclocktower commented 1 year ago

More about this odd behavior: if you change all the _s in my example to *s, everything appears to work just fine. There is something about underscores in particular that seems to be triggering this.

Switch all the emphasized phrases to be double-emphasized with **s and they're stable. But change them to __s and they won't show up until you select-all, copy, and paste. The first incremental change thereafter switches the strong_emphasis to emphasis (and makes the parser fail to notice one pair of underscores); the next change removes emphasis altogether.

MDeiml commented 1 year ago

Hm I see. However underscores and stars do work slightly different according to the spec, so this might be entirely sensible. But it gives a hint to what the problem might be.

Specifically the parsing of emphasis delimiters depends on what comes after them (which is not a big problem) and what comes before them. The latter is handled in a quite hacky way, so thinking about it now I'm guessing that that's what breaks the incremental parsing, but I'll have to test this.

savetheclocktower commented 5 months ago

Faced with user reports of crashes on ikatyang/tree-sitter-markdown (which we've been using for Markdown support in Pulsar), I'm revisiting this issue, as it's the main blocker for us to adopt this parser for our Markdown needs.

nvim-treesitter#4716 was strangely reassuring that the root cause may have nothing to do with web-tree-sitter. I can still reproduce this issue in a Pulsar environment, but — frustratingly — can't reproduce it in the playground anymore.

If this issue doesn't manifest in a test corpus or when testing tree-sitter-markdown_inline in isolation, could it be related to the integration between the markdown and markdown_inline parsers? We've been treating the latter as a simple injection into the former's inline nodes the way Neovim seems to be doing it; that's what you've envisioned, right?

At some point I'm going to open up scanner.c and just try some stuff to see if it helps. @MDeiml, you referred to some hackiness regarding the parsing of underscores; if you could elaborate on that a bit, I'd be happy to attempt a different strategy to see if it fixes this bug.

MDeiml commented 5 months ago

If this issue doesn't manifest in a test corpus or when testing tree-sitter-markdown_inline in isolation, could it be related to the integration between the markdown and markdown_inline parsers? We've been treating the latter as a simple injection into the former's inline nodes the way Neovim seems to be doing it; that's what you've envisioned, right?

Exactly. For complete correctness the named children of the inline node have to be excluded from the injection (this is for things like quotes, where the inline content is more a "visual block" than a byte range in the file).

At some point I'm going to open up scanner.c and just try some stuff to see if it helps. @MDeiml, you referred to some hackiness regarding the parsing of underscores; if you could elaborate on that a bit, I'd be happy to attempt a different strategy to see if it fixes this bug.

I was referring to the use of the LAST_TOKEN_... tokens, that (in conjunction how they are used in grammar.js) tell scanner.c information about the last token or symbol that was consumed. These are both used for underscores https://github.com/MDeiml/tree-sitter-markdown/blob/28aa3baef73bd458d053b613b8bd10fd102b4405/tree-sitter-markdown-inline/src/scanner.c#L271

and stars https://github.com/MDeiml/tree-sitter-markdown/blob/28aa3baef73bd458d053b613b8bd10fd102b4405/tree-sitter-markdown-inline/src/scanner.c#L147

The spec https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis says that whether something can be a delimiter for emphasis depends also on the tokens that come before it. This is pretty unintuitive to implement with tree sitter (which does not let you look back at input that is already parsed). The rules for underscores are more complicated than stars. I guess on OK temporary fix would be to make both behave like stars. The problem with that is that inputs like lower_snake_case will lead to emphasis being parsed.

savetheclocktower commented 5 months ago

The problem with that is that inputs like lower_snake_case will lead to emphasis being parsed.

Gotcha. I'm not about to make a change that violates the CommonMark spec, because internal underscores definitely should not signify emphasis.

When I get time, I'm going to see if I can map printf calls (or something like them) to console.log calls inside of Pulsar. I need to understand better exactly where the current logic is falling down — because I'm looking at parse_underscore and it seems like it would do the job fine. I'll let you know what I find.

savetheclocktower commented 5 months ago

OK, I dropped a bunch of logging into parse_underscore. Here's what's going on.

Sample text:

Lorem ipsum _foo_ dolor *sit* amet.

When the editor loads, I select the whole buffer, cut it, clear the log, and paste it just to have a consistent baseline. I can see it enter parse_underscore a total of three times: once for column 12, once for column 16, and once again for column 16. I notice that there are state differences between the two passes:

parse_underscore, column 12: consumes the underscore, identifies it as a one-underscore run.
- Identifies the underscore as left-flanking; right_flanking is false.
- next_symbol_whitespace is false
- next_symbol_punctuation is false
- valid_symbols[LAST_TOKEN_PUNCTUATION] is false
- valid_symbols[LAST_TOKEN_WHITESPACE] is true
- Hits the “this underscore opens a word” path
parse_underscore, column 16: consumes the underscore, identifies it as a one-underscore run.
- Identifies the underscore as right-flanking.
- next_symbol_whitespace is true
- next_symbol_punctuation is false
- valid_symbols[LAST_TOKEN_PUNCTUATION] is false
- valid_symbols[LAST_TOKEN_WHITESPACE] is false
- Hits the “this underscore ends a word” path
parse_underscore, column 16: consumes the underscore, identifies it as a one-underscore run.
- Identifies the underscore as right-flanking.
- next_symbol_whitespace is true
- next_symbol_punctuation is false
- valid_symbols[LAST_TOKEN_PUNCTUATION] is false
- valid_symbols[LAST_TOKEN_WHITESPACE] is false
- DOES NOT hit either the “opens” or “closes” code path because:
- valid_symbols[EMPHASIS_OPEN_UNDERSCORE] is true
- valid_symbols[EMPHASIS_CLOSE_UNDERSCORE] is false
- right_flanking is true
- left_flanking is false
- valid_symbols[LAST_TOKEN_PUNCTUATION] is false
- valid_symbols[LAST_TOKEN_WHITESPACE] is false
- next_symbol_punctuation is false

At this point, I put the cursor in the middle of dolor — between the o and l — and insert a space. I lose formatting on the _foo_. The *sit* is unaffected. Here's what got logged:

parse_underscore, column 12: consumes the underscore, identifies it as a one-underscore run.
- Identifies the underscore as left-flanking, but this time right_flanking is also true.
- next_symbol_whitespace is false
- next_symbol_punctuation is false
- valid_symbols[LAST_TOKEN_PUNCTUATION] is false
- valid_symbols[LAST_TOKEN_WHITESPACE] is false
- DOES NOT hit either the “opens” or “closes” code path because:
- valid_symbols[EMPHASIS_OPEN_UNDERSCORE] is true
- valid_symbols[EMPHASIS_CLOSE_UNDERSCORE] is false
- right_flanking is true
- left_flanking is true
- valid_symbols[LAST_TOKEN_PUNCTUATION] is false
- valid_symbols[LAST_TOKEN_WHITESPACE] is false
- next_symbol_punctuation is false
- Hence the fact that right_flanking is also true prevented it from recognizing the underscore as an opening emphasis delimiter. But right_flanking is true on this pass because valid_symbols[LAST_TOKEN_WHITESPACE] is incorrectly false.
parse_underscore, column 16: consumes the underscore, identifies it as a one-underscore run.
- Because there's no pending opening emphasis, the conclusion is foregone: valid_symbols[EMPHASIS_CLOSE_UNDERSCORE] is false, because the parser didn't identify the opening emphasis delimiter correctly.

So it seems like the LAST_TOKEN_WHITESPACE token is sometimes incorrectly false. I'll keep digging.

Just for further information:

Adding a space before the _foo_ didn't cause the bug.
Adding a space at the very end of the line didn't cause the bug.

savetheclocktower commented 5 months ago

I added logging for each entry into the scan function; it logs the current column along with whether valid_symbols[LAST_TOKEN_WHITESPACE] is true.

On initial parse, scan is called at these positions:
- Column 0: LAST_TOKEN_WHITESPACE is true
- Column 5: LAST_TOKEN_WHITESPACE is false
- Column 6: LAST_TOKEN_WHITESPACE is true
- Column 11: LAST_TOKEN_WHITESPACE is false
- Column 12: LAST_TOKEN_WHITESPACE is true
Then we hit parse_underscore (the outcome of which is logged above)
Then scan again:
- Column 13: LAST_TOKEN_WHITESPACE is false
- Column 13: LAST_TOKEN_WHITESPACE is false (yes, twice in a row)
- Column 16: LAST_TOKEN_WHITESPACE is false
We hit parse_underscore again (the outcome of which is logged above)
Then scan again:
- Column 16: LAST_TOKEN_WHITESPACE is false (yes, twice in a row)
We hit parse_underscore the third time (the outcome of which is logged above)
Then scan again at columns 17, 18, 23, 24, 24 (again), 25, 28, 28 (again), 29, 30, 34. These don't seem relevant, but I'll go deeper on them later if needed.

I insert the space between do and lor as before. On the second pass:

scan called at these positions:
- Column 6: LAST_TOKEN_WHITESPACE is false
- Column 12: LAST_TOKEN_WHITESPACE is false
parse_underscore (the outcome of which is logged above)
scan at:
- Column 13: LAST_TOKEN_WHITESPACE is false
- Column 16: LAST_TOKEN_WHITESPACE is false
parse_underscore (the outcome of which is logged above)
scan again at all the previous columns, except where content has shifted. The first time LAST_TOKEN_WHITESPACE is registered as a valid symbol is at column 18, just after the second underscore.

I thought it might have something to do with the includedRanges argument on the re-parse, but it doesn't. We ask for a parse on the entire line initially — (1, 0) - (1, 35). And after the change, we ask for a parse on the entire line — (1, 0) - (1, 36).

So I think it's got to be the fact that the second call provides the previous tree to Tree-sitter so it can reuse the parts that didn't change. And when it does that, Tree-sitter can resume parsing at a point where this parser can't count on whatever trick it's using with the LAST_TOKEN_ symbols.

savetheclocktower commented 5 months ago

OK. So! I gather that the hacky part is that, wherever whitespace has just been, you include optional($._last_token_whitespace)) so that the parser flips valid_symbols[LAST_TOKEN_WHITESPACE] to true and you can act upon it in the scanner.

Starting with simple fixes and progressing to harder ones, I'll just brainstorm:

I know zero-width tokens are a bit tricky, but I've used one before to enforce lack of whitespace. Can the parser actually generate some sort of $._whitespace_boundary token that the _emphasis_underscore rule(s) can enforce at the beginning of a sequence?
Can the presence or absence of whitespace boundaries instead be serialized to scanner state?
Can a word_like_this (or a _word_like_this or word_like_this_ be intentionally detected by an anonymous rule so that it can be designated as ordinary text, preventing it from being considered as an emphasis range? In other words, can we weed out the false positives in the grammar DSL so that the scanner doesn't have to consider them?

MDeiml commented 5 months ago

Thanks for your hard work, this already tells us that the LAST_TOKEN_ symbols are definitely the problem, but not quite why they don't work.

I know zero-width tokens are a bit tricky, but I've used one before to enforce lack of whitespace. Can the parser actually generate some sort of $._whitespace_boundary token that the _emphasis_underscore rule(s) can enforce at the beginning of a sequence?

This could work, but then the emphasis would need to also consume the whitespace (which is not a problem, as long as the whitespace cannot be part of something else, which I think is true)

Can the presence or absence of whitespace boundaries instead be serialized to scanner state?

Maybe, but this would probably be unstable. The scanner could save the column of the last whitespace it encountered, but I think the way columns are defined that could interfere with the block structure of the markdown document. Even worse if the markdown is embedded in some other document. So I would rather avoid it.

Can a word_like_this (or a _word_like_this or word_like_this_ be intentionally detected by an anonymous rule so that it can be designated as ordinary text, preventing it from being considered as an emphasis range? In other words, can we weed out the false positives in the grammar DSL so that the scanner doesn't have to consider them?

This one is very hard to reason about. It would probably mean to change the rule for words to be something like [a-zA-Z]+(_+[a-zA-Z]+)* (a bit hard to explain why that might work). Also make underscores behave like stars otherwise. Very easy to implement, but might work correctly or might not.

I'll try to implement 3) and check if it destroys any tests. If no perfect. If yes there could be some quite complicated fix with 1). 2) as I said I would rather avoid if possible.

savetheclocktower commented 5 months ago

Thanks for your hard work, this already tells us that the LAST_TOKEN_ symbols are definitely the problem, but not quite why they don't work.

It would help if I understood how Tree-sitter reuses an existing tree. For instance, the point at which it begins re-parsing after a change is odd to me. I'd understand if it re-parsed starting at the beginning of the buffer line, or at the point of the change, but it seems to re-parse at the exact point that causes problems for your technique.

I'd imagine that it was reparsing earlier, and that column 6 was the first place it needed to consult the external scanner; but then we know that's not true from the results we got on the first pass.

Maybe, but this would probably be unstable. The scanner could save the column of the last whitespace it encountered, but I think the way columns are defined that could interfere with the block structure of the markdown document. Even worse if the markdown is embedded in some other document. So I would rather avoid it.

What I'm trying to encode here is not a specific column number but just the boolean valid_symbols[LAST_TOKEN_WHITESPACE] result. Again, this is an area I'm fuzzy on, but in theory the state information is stored in the tree, so it would be available to Tree-sitter when reusing a previous tree. In theory, that state information could no longer be valid if the parser re-used it later, but I've got to think that if Tree-sitter couldn't detect those cases itself, it would not be safe for it to re-use old trees.

I'm going to ask about this in the Tree-sitter Discord tonight or tomorrow and see what I learn. I might also look at some first-party parsers to see how their scanners use state.

This one is very hard to reason about. It would probably mean to change the rule for words to be something like [a-zA-Z]+(_+[a-zA-Z]+)* (a bit hard to explain why that might work). Also make underscores behave like stars otherwise. Very easy to implement, but might work correctly or might not.

I might already be talking myself out of this. Consider _lorem ipsum dolor_sit amet_: GitHub renders that as _lorem ipsum dolorsit amet, so false positives can't be detected on the level of individual words.

MDeiml commented 5 months ago

Can you checkout the branch https://github.com/MDeiml/tree-sitter-markdown/tree/stable-underscores

I implemented the changes and it only broke one (unimportant) test, which already doesn't work for stars. That would be ok for me if it means the parser is more stable.

MDeiml commented 5 months ago

It would help if I understood how Tree-sitter reuses an existing tree. For instance, the point at which it begins re-parsing after a change is odd to me. I'd understand if it re-parsed starting at the beginning of the buffer line, or at the point of the change, but it seems to re-parse at the exact point that causes problems for your technique.

I'd imagine that it was reparsing earlier, and that column 6 was the first place it needed to consult the external scanner; but then we know that's not true from the results we got on the first pass.

I would guess it also has to do with "conflicts" which I use for emphasis. There's also the tree-sitter command line tool, which sometimes helps with debugging. But it's not great for this "interactive" part of the parsers.

savetheclocktower commented 5 months ago

I implemented the changes and it only broke one (unimportant) test, which already doesn't work for stars. That would be ok for me if it means the parser is more stable.

It seems to work fine. Amazingly, it passes the _lorem ipsum dolor_sit amet_ test I described above. I'll use it for a few days and let you know if I run into anything else.

MDeiml commented 4 months ago

@savetheclocktower I guess this means everything works fine?

savetheclocktower commented 4 months ago

@savetheclocktower I guess this means everything works fine?

It's been great so far. I've been using this in Pulsar as a hastily-thrown-together alternative Markdown package and I've noticed only two things:

One time I noticed that italics weren't being marked as italics. If I recall correctly, it was a multi-word span flanked with underscores. Making small buffer edits didn't seem to fix it. I cut and re-pasted the entire line and the italics appeared. I immediately tried to reproduce the problem and could not. It hasn't happened since.
About a week ago I was writing a long Markdown document and my editor crashed. I'm kicking myself for instinctively closing the window before going to the console so I could see what the error message was.

I've put the alternative Markdown package up on our package repository and invited people to use it who were having problems with the built-in Markdown package. The common theme among all those folks seemed to be that they were all using non-US and non-UK keyboards, and the other Markdown parser crashed in certain special input scenarios. The author of that comment mentioned just this morning that they're giving the alternative package a try.

Anyway, your change is a positive one, as far as I'm concerned. Even if I could reproduce issue 1, it's still an improvement over the status quo; and even if issue 2 was caused by the parser, I doubt it's related to this specific problem with italics. So I'm happy for you to close this issue. Thanks!