tree-sitter-grammars / tree-sitter-markdown

Markdown grammar for tree-sitter
MIT License
411 stars 52 forks source link

highlighting broken when more lines with inline code, last with `[x][]` appear #75

Closed litoj closed 1 year ago

litoj commented 1 year ago

Describe the bug when having one or more lines with any inline code (`xxx`) and the next line contains code with [x][] within, it highlights as code the text between the end of previous end of inline code with the start of this inline code and treats the last code as a link

Code example

`some code`
normal text (or even nothing) `[index][]`

Expected behavior

image

Actual behavior image

There must be two square bracket pairs and text in the first one of them for this to happen. this is stopped, when the previous inline code (doesn't matter how many lines prior) is separated from this text by an empty line.

MDeiml commented 1 year ago

I did some digging and discovered that this is related to a discussion I already had in the tree-sitter repo: https://github.com/tree-sitter/tree-sitter/discussions/1546

Basically tree-sitter consumes files one token at a time and as such sometimes has to track multiple different interpretations of the input at the same time. In tree-sitter this is called "conflicts". Usually it is suggested to avoid this behavior, but as markdown has very weird syntax that is not possible for this parser.

The problem now arises when the parser has to track too many interpretations at the same time. In this case it throws away the interpretation it considers "worst". For the input

`a` b `[c][]`

it gets until the last ] before it throws away the "correct" interpretation. At that point the input it has seen so far is

`a` b `[c][]

so the syntax highlighting you're getting makes sense for that input.

I don't know of any easy way to solve this without changes in tree-sitter itself, as such the possible solutions are:

  1. Avoid "conflicts" for code spans, e.g. by parsing ahead until the closing `. (Probably fast, but output might be less correct)
  2. Change tree-sitter to be able to configure the maximum number of branches i.e. "interpretations" (Slower but more correct)

I'm going to try implementing strategy 1. and see if I can get it to work, but maybe @maxbrunsfeld could comment?

maxbrunsfeld commented 1 year ago

Yeah, I think it would sense to match all inline delimiters by scanning ahead. That should be pretty fast, and allow the actual parsing process to be simpler, since there won’t be constant ambiguity.

MDeiml commented 1 year ago

Thanks for the quick answer! Sadly scanning ahead is only (easily) possible for code spans, because other elements like emphasis have weird nesting rules. But I'm going to attempt this approach for code spans then :+1:

litoj commented 1 year ago

If this is what was the update in nvim-treesitter markdown parser, than it doesn't work, at least for me. But I will wait a week and than make a report again, in case it still appears, just to give guys at nvim-TS enough time to be sure it was updated, though it would seem to have happened already.

MDeiml commented 1 year ago

Thanks for the quick feedback. I don't think it has updated to this change yet. As I understand it, it always tracks the newest "github release" (this fix is present only in v0.1.5, which I published just now).

litoj commented 1 year ago

Oh, ok, thanks for the explanation. And also for the fix.