taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

Wrong output on malformed HTML #268

Open amartini opened 7 months ago

amartini commented 7 months ago

I know it's hard to predict every malformed HTML possibilities, but I came across this while scraping a website. The misplaced apostrophe before the > of the <a> makes the parser skip the rest of the row. This code displays correctly on browsers (the invalid token is discarded). If you remove the ' the code runs correctly.

import { parse } from 'node-html-parser';

const html = `
<table id="mytable">
<tr class="myrow">
  <td>1</td>
  <td><a href="#" 2'>x</a></td>
  <td>2</td>
</tr>
<tr class="myrow">
  <td>3</td>
  <td><a href="#" 2'>x</a></td>
  <td>4</td>
</tr>
</table>
`;

const root = parse(html);

for (let tr of root.querySelectorAll("#mytable tr.myrow")) {
  console.log(tr.querySelectorAll(":scope > td").map(e => e.innerText));
}