taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Tags are stripped #91

Closed ali-habibzadeh closed 3 years ago

ali-habibzadeh commented 3 years ago

When tags are not self closing nor closed the parser just strips them:

const html = parse(`
  <head>
  <div>3</div><iframe><p><h1>
  </head>`);
  console.log(html.querySelector("head").innerHTML);

Logs:

  <div>3</div>

Expected:

  <div>3</div>
  <iframe></iframe>
  <p></p>
  <h1></h1>
taoqf commented 3 years ago

I am so sorry to hear that, but it is designed to be like that. The result is also not going to be correct even in browser .

ali-habibzadeh commented 3 years ago

The browser would be completely different (Similar to JSDOM, Cheerio, etc.). They correct the DOM and transfer the malformed positioning into correct container which in this case would be BODY. Which browser did you try if I may ask?

I picked this lib over others since I needed a parser that wouldn't correct the DOM but simply parse it which this does that, however it is ignoring not self closed or properly closed tags. I have not seen a browser that does that, but would be good to know which one you tried.

taoqf commented 3 years ago

I checked the issues again and find this. I do remember I responsed some time ago. You can try this in chrome:

const div = document.createElement('div');
div.innerHTML = `<head>
  <div>3</div><iframe><p><h1>
  </head>`;
div;

you would see the result

<div>
<div>3</div>
<iframe>
"<p><h1>
</head>"
</iframe>
</div>

see, that is not you expected either.

I will not do that extra work to slow down the parsing speed for this reason.

actuall, you could fork ths rep and do some change here as your wish.

wish you good luck.