taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

100% cpu while parsing document #260

Open papirosko opened 9 months ago

papirosko commented 9 months ago

simple code hangs causing kube to kill pod:

import {parse} from 'node-html-parser';
const html = // load https://www.a1supplements.com/
const root = parse(html);

I use:

    "node-html-parser": "^6.1.10",

host: https://www.a1supplements.com/ html size: 4620497 symbols

I have these from node inspect:

break in node_modules/node-html-parser/dist/nodes/html.js:1192
 1190                     oneBefore.removeChild(last);
 1191                     last.childNodes.forEach(function (child) {
>1192                         oneBefore.appendChild(child);
 1193                     });
 1194                 }

The contents (in case the website will be updated):

a1supplements.com.html.txt

papirosko commented 9 months ago

actually it finished parsing in 7 minutes on my macbook pro with i7. is it considered to be correct?

papirosko commented 9 months ago

as a workaround i use this (i mostly need data only from ):

    const root = parse(html, {
        parseNoneClosedTags: false,
        fixNestedATags: false,
        blockTextElements: {
            'div': true,
            'p': true,
            'pre': true
        }
    });
    const title = option(root.querySelector('title'))
        .map(x => x.text)
        .filter(x => !!x && x.trim().length > 0);
taoqf commented 9 months ago

I'm so sorry I could find any clue about your usecase. I even could not find title element. I did not get a macbook either. But I parsed the file you uploaded and it finished parsing immediately .

papirosko commented 9 months ago

Using:

  ...
  "dependencies": {
    "axios": "^1.6.2",
    "node-html-parser": "^6.1.10",
  }
  ...

This is how to reproduce it (using default options in parse):

import parse from 'node-html-parser';
import axios from 'axios';

async function runImpl() {
    const url = 'https://www.a1supplements.com/';
    const resp = await axios.get(url);
    const html = resp.data;

    const start = Date.now();
    parse(html);
    const duration = Date.now() - start;
    console.log(`Parsing took: ${duration.toLocaleString()}ms, document size: ${html.length.toLocaleString()} chars`)
}

runImpl().then(() => console.log('done'))

result:

Parsing took: 383,073ms, document size: 4,705,196 chars
done

Using custom options in parse:

    ...
    const start = Date.now();
    parse(html, {
        parseNoneClosedTags: false,
        fixNestedATags: false,
        blockTextElements: {
            'div': true,
            'p': true,
            'pre': true,
            script: true,
            noscript: true,
            style: true,
        }
    });
    const duration = Date.now() - start;
    ...

results:

Parsing took: 284ms, document size: 4,705,196 chars
done

I believe that blockTextElements -> div generally fixes the issue

taoqf commented 9 months ago

I don't think we should block div elements. that maybe the html is broken, the option parseNoneClosedTags: true will speed up.

sidpremkumar commented 5 months ago

parseNoneClosedTags: true

Not sure if it was related, but had a big html page and adding this fixed it.

Curious what this mean @taoqf , couldn't find any documentation or issues around it?