taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

Something strange with `premises` tag and querySelector() / querySelectorAll() #156

Closed johnnyoshika closed 2 years ago

johnnyoshika commented 3 years ago

It seems the premises tag (or something similar like x-premises, etc) has some strange and unexpected behavior with querySelector() and querySelectorAll().

Example:

import { parse } from "node-html-parser";

const sample1 = parse(
  "<premises><color>Red</color><color>Green</color></premises>"
);
console.log(sample1.querySelectorAll("color").length); // 0

const sample2 = parse("<foo><color>Red</color><color>Green</color></foo>");
console.log(sample2.querySelectorAll("color").length); // 2

The only difference between sample1 and sample2 is that sample1 wraps the HTML in a premises tag, whereas sample 2 uses foo. Yet the query for the color tag yields different results.

Here's a Codesandbox example: https://codesandbox.io/s/node-html-parser-premises-k5z81?file=/src/index.js:43-210

nonara commented 3 years ago

Thanks for the report + the repro!

Looks like it's getting treated as preformatted (pre tag). Seems there's some regex which matches against partial, instead of full, tag name.

Will have a look at correcting it this weekend. I believe as a temporary workaround, passing the following should work:

{
  blockTextElements: {
    script: true,
    noscript: true,
    style: true
  }
}
johnnyoshika commented 3 years ago

That option did the trick: https://codesandbox.io/s/node-html-parser-premises-k5z81?file=/src/index.js

Please let me know when you cut a new release with this fix, as I'll update my projects with it. Thanks for the quick response!

nonara commented 2 years ago

Corrected in v5