google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 660 forks source link

Invalid parsing of whitespace at document end #410

Closed ScumCoder closed 1 year ago

ScumCoder commented 5 years ago

When parsing a trivial document, the GumboStringPiece containing the original_text of the GumboText describing GUMBO_NODE_WHITESPACE, has incorrect length value, which causes it to include closing tags.

Also, the text field contains two linebreaks instead of one.

See SSCCE here.

Used version is aa91b27.

ScumCoder commented 5 years ago

Come to think about it, there is something fishy about previous whitespaces as well. A document looking like this

<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>

should produce a root HTML node with five children, not three:

  1. WHITESPACE
  2. HEAD
  3. WHITESPACE
  4. BODY
  5. WHITESPACE

each whitespace consisting of a single newline character.

craigbarnes commented 4 years ago

Come to think about it, there is something fishy about previous whitespaces as well. A document looking like this

<!DOCTYPE html>
<html>
<head>
</head>
<body>
</body>
</html>

should produce a root HTML node with five children, not three:

  1. WHITESPACE
  2. HEAD
  3. WHITESPACE
  4. BODY
  5. WHITESPACE

each whitespace consisting of a single newline character.

If you load that document into Chromium and run document.documentElement.childNodes.length in the console, it gives a result of 3. Likewise for Firefox.

So without consulting the spec, I'm inclined to think Gumbo is doing what it's supposed to do.