rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Incorrect parsing with self-closing tag #162

Closed sdalu closed 3 years ago

sdalu commented 3 years ago

When using self-closing tag the parser doesn't close correctly the tag. It doesn't seem to happen with tag which are part of html5

t="<div><bib/><bib/></div>"
puts Nokogiri::HTML5(t).to_xml

Result

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html>
  <head/>
  <body>
    <div>
      <bib>
        <bib/>
      </bib>
    </div>
  </body>
</html>

Expected

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html>
  <head/>
  <body>
    <div>
      <bib/>
      <bib/>
    </div>
  </body>
</html>
rubys commented 3 years ago

There is no such thing as a self closing tag syntax in HTML5. Try executing the following in the console of your favorite browser:

div = document.createElement('div'); div.innerHTML = '<div><bib/><bib/></div>'; div.innerHTML

The result you will see is:

"<div><bib><bib></bib></bib></div>"

The purpose of HTML5 (and therefore Gumbo) is to parse input as browser do.

stevecheckoway commented 3 years ago

In more detail, bib isn't an element defined in HTML, so the tree-construction stage of the HTML parser acts (in this case) according to the "Any other start tag" in the "in body" insertion mode. This says to insert an element for the tag and make it the current node in the tree. The fact that it was self-closing is ignored.

The second <bib/> is treated identically: a bib element is inserted as a child of the current node which is the first bib element.

There actually are some elements in HTML that can be self-closing. These are the void elements (those that have no contents) and some SVG and MathML elements.

Any other start tags that are self-closing are parse errors. Specifically, a non-void-html-element-start-tag-with-trailing-solidus parse error. That link contains the example <div/><span></span><span></span> where the two span elements are children of the div.

If I were to guess, I'd say your example will give rise to three parse errors: one about missing DOCTYPE and two non-void-html-element-start-tag-with-trailing-solidus errors. Let's give it a go.

doc = Nokogiri::HTML5('<div><bib/><bib/></div>', max_errors: 20)
doc.errors.each { |err| puts(err) }

prints out

1:1: ERROR: Expected a doctype token
<div><bib/><bib/></div>
^
1:6: ERROR: Start tag of nonvoid HTML element ends with '/>', use '>'.
<div><bib/><bib/></div>
     ^
1:12: ERROR: Start tag of nonvoid HTML element ends with '/>', use '>'.
<div><bib/><bib/></div>
           ^
1:18: ERROR: That tag isn't allowed here  Currently open tags: html, body, div, , .
<div><bib/><bib/></div>
                 ^

Missed one! The final </div> is an error from the rule 'An end tag whose tag name is one of: […] "div" […]' in the "in body" insertion mode. In this case, the current node (the second bib element) is not an HTML element with the same tag name as the </div> token.

That final error does suggest we need to fix the error message though. There should be two bibs there.

sdalu commented 3 years ago

Thanks, for all the precisions. (perhaps I'll need to fallback to xhtml)