lezer-parser / html

An HTML parser for Lezer
MIT License
13 stars 10 forks source link

dialect selfClosing is not working - parse error at SelfClosingEndTag #13

Closed milahu closed 6 months ago

milahu commented 6 months ago

for a semantic stage using this parser it is useful to know the difference between ">" and "/>"

by default, both ">" and "/>" are parsed as EndTag so currently, i need some extra if/then/else logic

i tried to parse "/>" as SelfClosingEndTag by enabling the selfClosing dialect but this gives a parse error at "/>"

input: <img><br/>

lezer-parser-html with default config ">" and "/>" produce node 4

0: node 10 = StartTag: "<"
1: node 22 = TagName: "img"
4: node 4 = EndTag: ">"
5: node 10 = StartTag: "<"
6: node 22 = TagName: "br"
8: node 4 = EndTag: "/>"

lezer-parser-html with .configure({ dialect: "selfClosing" }) "/>" gives a parse error

0: node 10 = StartTag: "<"
1: node 22 = TagName: "img"
4: node 4 = EndTag: ">"
5: node 10 = StartTag: "<"
6: node 22 = TagName: "br"
8: node 0 = ⚠: ""
8: node 16 = Text: "/>"

what would tree-sitter-html do? ">" and "/>" produce different nodes by default: node 3 and node 6

0: node 5 = <: "<"
1: node 17 = tag_name: "img"
4: node 3 = >: ">"
5: node 5 = <: "<"
6: node 17 = tag_name: "br"
8: node 6 = />: "/>"

https://github.com/lezer-parser/html/blob/fa8c9d581062bbf9d9d018637657a196d4e0cf0e/src/html.grammar#L108-L109

for a semantic stage using this parser

im using a custom tree walker that returns a sequence of tokens so when i concat all these tokens, i get the original source text

lezer-parser-html ```js // https://codereview.stackexchange.com/a/97886/205605 // based on nix-eval-js/src/lezer-parser-nix/src/nix-format.js /** @param {Tree | TreeNode} tree */ function walkHtmlTree(tree, func) { const cursor = tree.cursor(); //if (!cursor) return ''; if (!cursor) return; let depth = 0; while (true) { // NLR: Node, Left, Right // Node const cursorTypeId = cursor.type.id; if ( !( cursorTypeId == 15 || // Document cursorTypeId == 20 || // Element cursorTypeId == 23 || // Attribute cursorTypeId == 21 || // OpenTag cursorTypeId == 37 || // CloseTag cursorTypeId == 38 || // SelfClosingTag // note: this is inconsistent in the parser // InvalidEntity is child node // EntityReference is separate node (sibling of other text nodes) cursorTypeId == 19 || // InvalidEntity: "&" is parsed as InvalidEntity //cursorTypeId == 17 || // EntityReference: "&" or "—" is parsed as EntityReference false ) ) { func(cursor) } // Left if (cursor.firstChild()) { // moved down depth++; continue; } // Right if (depth > 0 && cursor.nextSibling()) { // moved right continue; } let continueMainLoop = false; let firstUp = true; while (cursor.parent()) { // moved up depth--; if (depth <= 0) { // when tree is a node, stop at the end of node // == dont visit sibling or parent nodes return; } if (cursor.nextSibling()) { // moved up + right continueMainLoop = true; break; } firstUp = false; } if (continueMainLoop) continue; break; } } import { parser as lezerParserHtml } from '@lezer/html'; const htmlParser = lezerParserHtml.configure({ //dialect: "selfClosing", }); const inputHtml = `
`; const htmlTree = htmlParser.parse(inputHtml); const topNode = htmlTree.topNode; let lastNodeTo = 0; walkHtmlTree(topNode, (node) => { const nodeSource = inputHtml.slice(lastNodeTo, node.to); lastNodeTo = node.to; console.log(`${node.from}: node ${node.type.id} = ${node.type.name}: ${JSON.stringify(nodeSource)}`) }); ``` tree-sitter-html ```py # https://github.com/tree-sitter/py-tree-sitter/issues/33 #def traverse_tree(tree: Tree): def walk_html_tree(tree, func): ignore_kind_id = [ 25, # fragment 26, # doctype 28, # element 29, # script_element 30, # style_element 31, # start_tag 34, # self_closing_tag 35, # end_tag 37, # attribute 38, # quoted_attribute_value ] cursor = tree.walk() reached_root = False while reached_root == False: if cursor.node.kind_id not in ignore_kind_id: #yield cursor.node func(cursor.node) if cursor.goto_first_child(): continue if cursor.goto_next_sibling(): continue retracing = True while retracing: if not cursor.goto_parent(): retracing = False reached_root = True if cursor.goto_next_sibling(): retracing = False last_node_to = 0 input_html = """
""" def walk_callback(node): nonlocal last_node_to s = json.dumps(node.text.decode("utf8")) print(f"{node.range.start_byte}: node {node.kind_id} = {node.type}: {s}") #node_source = input_html[last_node_to:node.range.end_byte] last_node_to = node.range.end_byte import tree_sitter import tree_sitter_languages tree_sitter_html = tree_sitter_languages.get_parser("html") html_parser = tree_sitter_html html_tree = html_parser.parse(input_html) top_node = html_tree.root_node walk_html_tree(top_node, walk_callback) ```
marijnh commented 6 months ago

Attached patch should help.

milahu commented 6 months ago

thanks, now "/>" is parsed as SelfClosingEndTag

stupid question: why is the selfClosing dialect not the default behavior?

html can contain arbitrary xml nodes like <custom/> where i cannot use the node name to detect self-closing nodes

for a semantic stage using this parser it is useful to know the difference between ">" and "/>"

marijnh commented 6 months ago

html can contain arbitrary xml nodes like <custom/>

HTML ignores the / in that syntax and does not treat this as a self-closing tag. So making the parser treat it as if works by default would be confusing to people.

milahu commented 6 months ago

aah, because HTML is a subset of SGML

so the DTD defines void elements which can end with > or /> but actually /> is XML syntax

<style>custom { color: red; }</style>
<div>
  <p>aaa</p>
  <custom/>
    <p>bbb</p> <!-- this is red -->
</div>
<p>ccc</p>

related: Are (non-void) self-closing tags valid in HTML5?