taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Bug Report #189

Closed po6ix closed 2 years ago

po6ix commented 2 years ago

Details in here: https://blog.p6.is/writeups-for-hayyim-security-ctf-2022/#Solution-5

const sanitize = require('sanitize-html');
const { parse } = require('node-html-parser');

let html = "</a<a><a><a<a\x0ba ";
console.log(parse(sanitize(html)).outerHTML); // <a></a</a><a a></a>

The problem is node-html-parser parse the unclosing tag and \x0b wrongly. And It can leads to a kindof mutation xss even with sanitized html string.

nonara commented 2 years ago

Thanks for the report.

Quoting the readme:

Per the design, it intends to parse massive HTML files in lowest price, thus the performance is the top priority. For this reason, some malformatted HTML may not be able to parse correctly

Considering that the library takes input HTML and produces a resulting set of nodes, XSS isn't really in the purview of this library. (How would we even distinguish between what was user input and what is not?) If taking user input to with the intention of threading it into HTML, the responsibility of validating that input falls on the software which preceeds the call to this library. The library's purpose to to provide a minimally manipulatable DOM tree for provided HTML at minimal cost.

Use of this library may be the right choice if there is a reasonable expectation of having valid HTML. If needs extend beyond this, using another library is advisable.