taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Skip `<template />` tags in `textContent` #235

Closed dgp1130 closed 1 year ago

dgp1130 commented 1 year ago

node-html-parser seems to render text from <template /> tags in the textContent property of child elements:

import { parse } from 'node-html-parser';
const doc = parse('<div>Hello, <template>test</template>World!</div>');
console.log(doc.textContent); // "Hello,testWorld!"

This is contrary to browsers where <template /> elements are ignored.

const doc = new DOMParser().parseFromString('<div>Hello, <template>test</template>World!</div>', 'text/html');
console.log(doc.documentElement.textContent); // "Hello, World!"

I'm using DOMParser for the example here, but doing the same thing on this actual DOM in a browser has the same output.

Two nuances here to keep in mind:

  1. Whitespace is preserved around a <template /> tag. Note that the correct output is Hello, World because there was a space prior to the <template />. This is also true with Hello, <template></template> World!, where all the whitespace is retained.
  2. Declarative shadow DOM is implemented as a <template shadowrootmode="open">Hello!</template>. The behavior here is awkward, since in the browser you'd never actually observe this in the real DOM, since it would get converted into a real shadow root. Shadow roots are printed with textContent, but it's an open question whether that would be the intuitive behavior here. Personally, I think this should be interpreted as a shadow root and included in textContent, but I can see others disagreeing with me.

I'm on node-html-parser@6.1.5, which is the current latest.

taoqf commented 1 year ago

I'm afraid I don't think this lib will do the same behavior as browser on this. template on browser is a special element, we cannot do this now, because this lib is just a html passing tool, and , we suspect it will be fast.