Closed savetheclocktower closed 4 months ago
Hm, I'll have to investigate, but on first glance this seems like a problem with tree-sitter itself and not this specific parser. For "proper" parsers it shouldn't be possible to get this behaviour, as the parser does not actually define the incremental parsing process. Instead incremental parsing is handled implicitly by the tree-sitter library.
The question is of course if this parser is "proper" since it does have a substantial amount of handwritten C code, but I'll have a look at that. Otherwise it might well be that this specific parser hits some edge-case bug in tree-sitter, since it is at least one of the most complex tree-sitter parsers out there and it uses a lot of "conflicts" which are probably the most problematic functionality of tree-sitter.
Isn't this exactly https://github.com/tree-sitter/tree-sitter/issues/2145?
Isn't this exactly tree-sitter/tree-sitter#2145?
No, not that I can tell. The symptom described in that issue is that the offsets don't change. (At any rate, I can't reproduce that symptom on the official tree-sitter playground when following the exact instructions described by the issue author.)
The symptom in this issue is that the tree does re-parse, and (as far as I can tell) offsets do change, but something about the change seems to cause the markdown_inline parser to drop, or fail to recognize, nodes — seemingly exactly one node with each re-parse.
More about this odd behavior: if you change all the _
s in my example to *
s, everything appears to work just fine. There is something about underscores in particular that seems to be triggering this.
Switch all the emphasized phrases to be double-emphasized with **
s and they're stable. But change them to __
s and they won't show up until you select-all, copy, and paste. The first incremental change thereafter switches the strong_emphasis
to emphasis
(and makes the parser fail to notice one pair of underscores); the next change removes emphasis
altogether.
Hm I see. However underscores and stars do work slightly different according to the spec, so this might be entirely sensible. But it gives a hint to what the problem might be.
Specifically the parsing of emphasis delimiters depends on what comes after them (which is not a big problem) and what comes before them. The latter is handled in a quite hacky way, so thinking about it now I'm guessing that that's what breaks the incremental parsing, but I'll have to test this.
Faced with user reports of crashes on ikatyang/tree-sitter-markdown
(which we've been using for Markdown support in Pulsar), I'm revisiting this issue, as it's the main blocker for us to adopt this parser for our Markdown needs.
nvim-treesitter
#4716 was strangely reassuring that the root cause may have nothing to do with web-tree-sitter
. I can still reproduce this issue in a Pulsar environment, but — frustratingly — can't reproduce it in the playground anymore.
If this issue doesn't manifest in a test corpus or when testing tree-sitter-markdown_inline
in isolation, could it be related to the integration between the markdown
and markdown_inline
parsers? We've been treating the latter as a simple injection into the former's inline
nodes the way Neovim seems to be doing it; that's what you've envisioned, right?
At some point I'm going to open up scanner.c
and just try some stuff to see if it helps. @MDeiml, you referred to some hackiness regarding the parsing of underscores; if you could elaborate on that a bit, I'd be happy to attempt a different strategy to see if it fixes this bug.
If this issue doesn't manifest in a test corpus or when testing
tree-sitter-markdown_inline
in isolation, could it be related to the integration between themarkdown
andmarkdown_inline
parsers? We've been treating the latter as a simple injection into the former'sinline
nodes the way Neovim seems to be doing it; that's what you've envisioned, right?
Exactly. For complete correctness the named children of the inline
node have to be excluded from the injection (this is for things like quotes, where the inline content is more a "visual block" than a byte range in the file).
At some point I'm going to open up
scanner.c
and just try some stuff to see if it helps. @MDeiml, you referred to some hackiness regarding the parsing of underscores; if you could elaborate on that a bit, I'd be happy to attempt a different strategy to see if it fixes this bug.
I was referring to the use of the LAST_TOKEN_...
tokens, that (in conjunction how they are used in grammar.js
) tell scanner.c
information about the last token or symbol that was consumed. These are both used for underscores https://github.com/MDeiml/tree-sitter-markdown/blob/28aa3baef73bd458d053b613b8bd10fd102b4405/tree-sitter-markdown-inline/src/scanner.c#L271
The spec https://spec.commonmark.org/0.30/#emphasis-and-strong-emphasis says that whether something can be a delimiter for emphasis depends also on the tokens that come before it. This is pretty unintuitive to implement with tree sitter (which does not let you look back at input that is already parsed). The rules for underscores are more complicated than stars. I guess on OK temporary fix would be to make both behave like stars. The problem with that is that inputs like lower_snake_case
will lead to emphasis being parsed.
The problem with that is that inputs like
lower_snake_case
will lead to emphasis being parsed.
Gotcha. I'm not about to make a change that violates the CommonMark spec, because internal underscores definitely should not signify emphasis.
When I get time, I'm going to see if I can map printf
calls (or something like them) to console.log
calls inside of Pulsar. I need to understand better exactly where the current logic is falling down — because I'm looking at parse_underscore
and it seems like it would do the job fine. I'll let you know what I find.
OK, I dropped a bunch of logging into parse_underscore
. Here's what's going on.
Sample text:
Lorem ipsum _foo_ dolor *sit* amet.
When the editor loads, I select the whole buffer, cut it, clear the log, and paste it just to have a consistent baseline. I can see it enter parse_underscore
a total of three times: once for column 12, once for column 16, and once again for column 16. I notice that there are state differences between the two passes:
parse_underscore
, column 12: consumes the underscore, identifies it as a one-underscore run.
right_flanking
is false
.next_symbol_whitespace
is false
next_symbol_punctuation
is false
valid_symbols[LAST_TOKEN_PUNCTUATION]
is false
valid_symbols[LAST_TOKEN_WHITESPACE]
is true
parse_underscore
, column 16: consumes the underscore, identifies it as a one-underscore run.
next_symbol_whitespace
is true
next_symbol_punctuation
is false
valid_symbols[LAST_TOKEN_PUNCTUATION]
is false
valid_symbols[LAST_TOKEN_WHITESPACE]
is false
parse_underscore
, column 16: consumes the underscore, identifies it as a one-underscore run.
next_symbol_whitespace
is true
next_symbol_punctuation
is false
valid_symbols[LAST_TOKEN_PUNCTUATION]
is false
valid_symbols[LAST_TOKEN_WHITESPACE]
is false
valid_symbols[EMPHASIS_OPEN_UNDERSCORE]
is true
valid_symbols[EMPHASIS_CLOSE_UNDERSCORE]
is false
right_flanking
is true
left_flanking
is false
valid_symbols[LAST_TOKEN_PUNCTUATION]
is false
valid_symbols[LAST_TOKEN_WHITESPACE]
is false
next_symbol_punctuation
is false
At this point, I put the cursor in the middle of dolor
— between the o
and l
— and insert a space. I lose formatting on the _foo_
. The *sit*
is unaffected. Here's what got logged:
parse_underscore
, column 12: consumes the underscore, identifies it as a one-underscore run.
right_flanking
is also true
.next_symbol_whitespace
is false
next_symbol_punctuation
is false
valid_symbols[LAST_TOKEN_PUNCTUATION]
is false
valid_symbols[LAST_TOKEN_WHITESPACE]
is false
valid_symbols[EMPHASIS_OPEN_UNDERSCORE]
is true
valid_symbols[EMPHASIS_CLOSE_UNDERSCORE]
is false
right_flanking
is true
left_flanking
is true
valid_symbols[LAST_TOKEN_PUNCTUATION]
is false
valid_symbols[LAST_TOKEN_WHITESPACE]
is false
next_symbol_punctuation
is false
right_flanking
is also true
prevented it from recognizing the underscore as an opening emphasis delimiter. But right_flanking
is true
on this pass because valid_symbols[LAST_TOKEN_WHITESPACE]
is incorrectly false
.parse_underscore
, column 16: consumes the underscore, identifies it as a one-underscore run.
valid_symbols[EMPHASIS_CLOSE_UNDERSCORE]
is false
, because the parser didn't identify the opening emphasis delimiter correctly.So it seems like the LAST_TOKEN_WHITESPACE
token is sometimes incorrectly false
. I'll keep digging.
Just for further information:
_foo_
didn't cause the bug.I added logging for each entry into the scan
function; it logs the current column along with whether valid_symbols[LAST_TOKEN_WHITESPACE]
is true.
scan
is called at these positions:
LAST_TOKEN_WHITESPACE
is true
LAST_TOKEN_WHITESPACE
is false
LAST_TOKEN_WHITESPACE
is true
LAST_TOKEN_WHITESPACE
is false
LAST_TOKEN_WHITESPACE
is true
parse_underscore
(the outcome of which is logged above)scan
again:
LAST_TOKEN_WHITESPACE
is false
LAST_TOKEN_WHITESPACE
is false
(yes, twice in a row)LAST_TOKEN_WHITESPACE
is false
parse_underscore
again (the outcome of which is logged above)scan
again:
LAST_TOKEN_WHITESPACE
is false
(yes, twice in a row)parse_underscore
the third time (the outcome of which is logged above)scan
again at columns 17, 18, 23, 24, 24 (again), 25, 28, 28 (again), 29, 30, 34. These don't seem relevant, but I'll go deeper on them later if needed.I insert the space between do
and lor
as before. On the second pass:
scan
called at these positions:
LAST_TOKEN_WHITESPACE
is false
LAST_TOKEN_WHITESPACE
is false
parse_underscore
(the outcome of which is logged above)scan
at:
LAST_TOKEN_WHITESPACE
is false
LAST_TOKEN_WHITESPACE
is false
parse_underscore
(the outcome of which is logged above)scan
again at all the previous columns, except where content has shifted. The first time LAST_TOKEN_WHITESPACE
is registered as a valid symbol is at column 18, just after the second underscore.I thought it might have something to do with the includedRanges
argument on the re-parse, but it doesn't. We ask for a parse on the entire line initially — (1, 0) - (1, 35). And after the change, we ask for a parse on the entire line — (1, 0) - (1, 36).
So I think it's got to be the fact that the second call provides the previous tree to Tree-sitter so it can reuse the parts that didn't change. And when it does that, Tree-sitter can resume parsing at a point where this parser can't count on whatever trick it's using with the LAST_TOKEN_
symbols.
OK. So! I gather that the hacky part is that, wherever whitespace has just been, you include optional($._last_token_whitespace))
so that the parser flips valid_symbols[LAST_TOKEN_WHITESPACE]
to true
and you can act upon it in the scanner.
Starting with simple fixes and progressing to harder ones, I'll just brainstorm:
$._whitespace_boundary
token that the _emphasis_underscore
rule(s) can enforce at the beginning of a sequence?word_like_this
(or a _word_like_this
or word_like_this_
be intentionally detected by an anonymous rule so that it can be designated as ordinary text, preventing it from being considered as an emphasis range? In other words, can we weed out the false positives in the grammar DSL so that the scanner doesn't have to consider them?Thanks for your hard work, this already tells us that the LAST_TOKEN_
symbols are definitely the problem, but not quite why they don't work.
- I know zero-width tokens are a bit tricky, but I've used one before to enforce lack of whitespace. Can the parser actually generate some sort of
$._whitespace_boundary
token that the_emphasis_underscore
rule(s) can enforce at the beginning of a sequence?
This could work, but then the emphasis would need to also consume the whitespace (which is not a problem, as long as the whitespace cannot be part of something else, which I think is true)
- Can the presence or absence of whitespace boundaries instead be serialized to scanner state?
Maybe, but this would probably be unstable. The scanner could save the column of the last whitespace it encountered, but I think the way columns are defined that could interfere with the block structure of the markdown document. Even worse if the markdown is embedded in some other document. So I would rather avoid it.
- Can a
word_like_this
(or a_word_like_this
orword_like_this_
be intentionally detected by an anonymous rule so that it can be designated as ordinary text, preventing it from being considered as an emphasis range? In other words, can we weed out the false positives in the grammar DSL so that the scanner doesn't have to consider them?
This one is very hard to reason about. It would probably mean to change the rule for words to be something like [a-zA-Z]+(_+[a-zA-Z]+)*
(a bit hard to explain why that might work). Also make underscores behave like stars otherwise. Very easy to implement, but might work correctly or might not.
I'll try to implement 3) and check if it destroys any tests. If no perfect. If yes there could be some quite complicated fix with 1). 2) as I said I would rather avoid if possible.
Thanks for your hard work, this already tells us that the
LAST_TOKEN_
symbols are definitely the problem, but not quite why they don't work.
It would help if I understood how Tree-sitter reuses an existing tree. For instance, the point at which it begins re-parsing after a change is odd to me. I'd understand if it re-parsed starting at the beginning of the buffer line, or at the point of the change, but it seems to re-parse at the exact point that causes problems for your technique.
I'd imagine that it was reparsing earlier, and that column 6 was the first place it needed to consult the external scanner; but then we know that's not true from the results we got on the first pass.
Maybe, but this would probably be unstable. The scanner could save the column of the last whitespace it encountered, but I think the way columns are defined that could interfere with the block structure of the markdown document. Even worse if the markdown is embedded in some other document. So I would rather avoid it.
What I'm trying to encode here is not a specific column number but just the boolean valid_symbols[LAST_TOKEN_WHITESPACE]
result. Again, this is an area I'm fuzzy on, but in theory the state information is stored in the tree, so it would be available to Tree-sitter when reusing a previous tree. In theory, that state information could no longer be valid if the parser re-used it later, but I've got to think that if Tree-sitter couldn't detect those cases itself, it would not be safe for it to re-use old trees.
I'm going to ask about this in the Tree-sitter Discord tonight or tomorrow and see what I learn. I might also look at some first-party parsers to see how their scanners use state.
This one is very hard to reason about. It would probably mean to change the rule for words to be something like
[a-zA-Z]+(_+[a-zA-Z]+)*
(a bit hard to explain why that might work). Also make underscores behave like stars otherwise. Very easy to implement, but might work correctly or might not.
I might already be talking myself out of this. Consider _lorem ipsum dolor_sit amet_
: GitHub renders that as _lorem ipsum dolorsit amet, so false positives can't be detected on the level of individual words.
Can you checkout the branch https://github.com/MDeiml/tree-sitter-markdown/tree/stable-underscores
I implemented the changes and it only broke one (unimportant) test, which already doesn't work for stars. That would be ok for me if it means the parser is more stable.
It would help if I understood how Tree-sitter reuses an existing tree. For instance, the point at which it begins re-parsing after a change is odd to me. I'd understand if it re-parsed starting at the beginning of the buffer line, or at the point of the change, but it seems to re-parse at the exact point that causes problems for your technique.
I'd imagine that it was reparsing earlier, and that column 6 was the first place it needed to consult the external scanner; but then we know that's not true from the results we got on the first pass.
I would guess it also has to do with "conflicts" which I use for emphasis. There's also the tree-sitter command line tool, which sometimes helps with debugging. But it's not great for this "interactive" part of the parsers.
I implemented the changes and it only broke one (unimportant) test, which already doesn't work for stars. That would be ok for me if it means the parser is more stable.
It seems to work fine. Amazingly, it passes the _lorem ipsum dolor_sit amet_
test I described above. I'll use it for a few days and let you know if I run into anything else.
@savetheclocktower I guess this means everything works fine?
@savetheclocktower I guess this means everything works fine?
It's been great so far. I've been using this in Pulsar as a hastily-thrown-together alternative Markdown package and I've noticed only two things:
I've put the alternative Markdown package up on our package repository and invited people to use it who were having problems with the built-in Markdown package. The common theme among all those folks seemed to be that they were all using non-US and non-UK keyboards, and the other Markdown parser crashed in certain special input scenarios. The author of that comment mentioned just this morning that they're giving the alternative package a try.
Anyway, your change is a positive one, as far as I'm concerned. Even if I could reproduce issue 1, it's still an improvement over the status quo; and even if issue 2 was caused by the parser, I doubt it's related to this specific problem with italics. So I'm happy for you to close this issue. Thanks!
I can reproduce this one in the tree-sitter playground on the latest
master
. From the root:Wait for the playground to open. Type some text with markdown formatting. The tree will not reflect the content.
Select everything you've written, cut it, then paste it again. The tree should be correct.
Change anything — even add a space at the very end of the line — and the tree will start to incorrectly shed nodes.
I was able to reproduce this just by pasting and editing the following text:
After pasting that content, add a space character anywhere, and you'll notice that the first
emphasis
node disappears from the tree. The next change (even just deleting the new space) will cause the secondemphasis
node to disappear, and so on.This is a symptom of what I observed on my own with the
web-tree-sitter
bindings: if you callparser.parse(someText, null)
, everything parses fine; if you callparser.parse(someText, existingTree)
, nodes get lost. I have no idea if this occurs under the rust or node bindings, but I've worked with at least a dozen tree-sitter parsers underweb-tree-sitter
and this is the first time I've observed any symptom like this.