taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Incorrect parsing of tag name #277

Open noway opened 5 months ago

noway commented 5 months ago

node-html-parser currently uses the following regex pattern to parse tag name:

https://github.com/taoqf/node-html-parser/blob/v6.1.14/src/nodes/html.ts#L924-L925

This is incorrect, since tag name can not only be for a custom element, but for any element. The correct part of the spec for parsing tag name is here: https://html.spec.whatwg.org/multipage/parsing.html#tag-name-state

Test case:

const parse = require('parse5').parse
const Parser = require('htmlparser2').Parser
const { parse: parseNhp } = require('node-html-parser')

const root2 = parse('<h@1>')
console.log('parse5:', root2.childNodes[0].childNodes[1].childNodes[0].nodeName)

const parser = new Parser({
  onopentag(name) {
    console.log('htmlparser2:', name)
  }
})
parser.write('<h@1>')
parser.end()

const root = parseNhp('<h@1>')
console.log('node-html-parser:', root.childNodes[0].rawTagName)

Output:

parse5: h@1
htmlparser2: h@1
node-html-parser:

HTML:

<h@1>

Chrome:

image

Firefox:

image

As you see above, h@1 tag name is correctly parsed by parse5, htmlparser2, Chrome and Firefox, but isn't parsed by node-html-parser.


In terms of the question of whether code containing h@1 is 'broken' or 'malformatted' HTML - it's not. Although h@1 is not permitted by any content models, it is permitted inside elements with 'nothing' content model.

The following code:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>test</title>
  </head>
  <body>
  <template>
    <h@1>Smile!</h@1>
  </template>
  </body>
</html>

passes HTML5 validator:

image