taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Regression: Versions >= v5.3.2 are unable to parse specific link #280

Open stalgiag opened 1 month ago

stalgiag commented 1 month ago

I work for a project that validates its links using this library. One link that is frequently validated is the HTML spec at https://html.spec.whatwg.org/. This page has one of the bigger HTML files on the web but node-html-parser was able to parse it well in approximately 23 seconds on my local machine until release 5.3.2.

Consider this example:

const HTMLParser = require('node-html-parser');
const nFetch = require('node-fetch');

async function parseHTMLSpec() {
  try {
    const response = await nFetch('https://html.spec.whatwg.org/');
    const html = await response.text();

    console.log('Fetched HTML. Attempting to parse...');
    console.time('parseHTMLSpec');
    const parsedHTML = HTMLParser.parse(html);
    console.timeEnd('parseHTMLSpec');

    console.log('HTML parsed successfully.');
    console.log('Title:', parsedHTML.querySelector('title').text);
  } catch (error) {
    console.error('Error occurred:', error);
  }
}

parseHTMLSpec();

With node-html-parser 5.3.1, this outputs the following:

Fetched HTML. Attempting to parse...
parseHTMLSpec: 23.415s
HTML parsed successfully.
Title: HTML Standard

With node-html-parser 5.3.2, this hangs indefinitely; only outputting the following even after running for hours:

console.log('Fetched HTML. Attempting to parse...');
taoqf commented 1 day ago

Sorry for the bad experience. I release a beta version node-html-parser@6.1.15-0 but I could not test it due to large memory usage. Could you test it for me? thank you.