tree-sitter / tree-sitter-html

HTML grammar for Tree-sitter
MIT License
136 stars 72 forks source link

How is the implicit end tag construct safe or incremental parsing? #21

Closed marijnh closed 3 years ago

marijnh commented 3 years ago

Hi. Sorry to put a question on the bug tracker, but I couldn't find a better channel.

The scanner will, when in an element that's closed by another opening tag, emit an IMPLICIT_END_TAG token. To do this, it looks past the <, at the next tag name, creating a dependency in the closed tag's syntax node on the name of the opening tag after it, despite that text being entirely inside another node in the resulting tree. Yet somehow incremental parsing works correctly if you change, say, <p>one<p>two to <p>one<span>two. Does anyone know why?

maxbrunsfeld commented 3 years ago

Hi Marijn,

For all tokens, including IMPLICIT_END_TAG, we store on the subtree a field called lookahead_bytes, which indicates how many bytes the lexer has read beyond the end of the subtree itself. This number gets propagated up the tree along with other numerical properties of subtrees. When editing a tree, and invaliding affected subtrees, lookahead_bytes is taken into account.

marijnh commented 3 years ago

Thanks for answering. I hadn't noticed lookahead_bytes yet. That seems like a solid approach.