taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Memory leak #64

Open AnatoliiVorobiov opened 4 years ago

AnatoliiVorobiov commented 4 years ago

I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy Code that can reproduce, can use any HTML with links:

const fs = require('fs');
const HTMLParser = require('node-html-parser');

(async () => {
  const array = [];
  setInterval(() => {
    fs.readFile('tmp.html', (err, buf) => {
      let lines = buf.toString();
      const root = HTMLParser.parse(lines);
      for (const elem of root.querySelectorAll('a')) {
        array.push(elem.getAttribute('href'));
      }
    })
  }, 2000);
})();
taoqf commented 4 years ago

Sorry I could not reproduce.

Dj-Polyester commented 3 years ago

I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy Code that can reproduce, can use any HTML with links:

const fs = require('fs');
const HTMLParser = require('node-html-parser');

(async () => {
  const array = [];
  setInterval(() => {
    fs.readFile('tmp.html', (err, buf) => {
      let lines = buf.toString();
      const root = HTMLParser.parse(lines);
      for (const elem of root.querySelectorAll('a')) {
        array.push(elem.getAttribute('href'));
      }
    })
  }, 2000);
})();

@AnatoliiVorobiov , could you provide your html file in order that source of the error can be seen more easily. I am also using this module in my project. There are other people like me who are likely to suffer from the same issue. Thanks.

taoqf commented 3 years ago

This should be resolved after v3.1.1

taoqf commented 3 years ago

This should be resolved since v3.1.1

mpcabete commented 2 years ago

I have had a problem parsing 1000 html documents, one at a time (removing reference from previous one), node js trow the error: Allocation failed - JavaScript heap out of memory I switched to jsdom and it solved the issue.

taoqf commented 2 years ago

In what case?

nonara commented 2 years ago

@mpcabete Sorry to hear you're having issues. I can't think of anywhere in the codebase that such a leak would be happening, but I'd be happy to take a look. Can you provide a snippet of code or repo that will reproduce the issue?

ThibaudLopez commented 1 year ago

I also get JavaScript heap out of memory.

I don't know if the problem is in my code, node-html-parser, v8, or Node.js. But it doesn't seem normal for node-html-parser to consume so much memory.

For instance, one million strings '<foo></foo>' only take 11 Mb of memory, node-html-parser can parse them successfully, and I can store the result in memory.

But adding just one child element, where one million strings '<foo><bar/></foo>' increases to only 17 Mb, and it crashes. That seems dramatic.

If I increase the v8 heap size with --max-old-space-size=2500 then it works, but I shouldn't need 2.5 Gb of memory to parse and store 17 Mb.

Steps to reproduce:

// Linux 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
// node v16.15.0
// node v18.12.1
// npm 8.5.5
// node-html-parser@6.1.4
var { parse } = require('node-html-parser');

This succeeds:

var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo></foo>'));

This crashes:

var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo><bar/></foo>'));
[89570:0x5dfe540]    82635 ms: Scavenge (reduce) 2044.6 (2082.1) -> 2044.2 (2082.3) MB, 4.2 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 
[89570:0x5dfe540]    82641 ms: Scavenge (reduce) 2045.0 (2082.3) -> 2044.6 (2082.6) MB, 3.7 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 
[89570:0x5dfe540]    82648 ms: Scavenge (reduce) 2045.3 (2082.6) -> 2044.8 (2082.8) MB, 5.5 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb09c10 node::Abort() [node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [node]
 3: 0xcf8dbe v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xcf9137 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xeb09d5  [node]
 6: 0xeb14b6  [node]
 7: 0xebf9de  [node]
 8: 0xec0420 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xec339e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
10: 0xe848da v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
11: 0x11fd626 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
12: 0x15f2099  [node]
Aborted (core dumped)
nonara commented 1 year ago

A leak is where something which should be garbage collected is retained. In your case, you add each result to an array, so the memory would not be freed.

In terms of memory usage, you're effectively parsing 1 million HTML documents which creates a full AST for each. This would be multiple nodes that are class instances and would certainly take more memory than a single string.

Generally speaking, in cases like needing to perform actions on millions of records, you're best suited to process one or a reasonable chunk of items per thread and proceed to the next item/chunk when finished so you're not filling up memory needlessly.

It could be worth running a profiler to investigate memory leaks, but leaks are rare, and I haven't seen any indication of them.

As for bloat, that could be worth investigating. I'd need to see some numbers which showed memory size per node + base document AST size. If it seems off, I can check it out. Unfortunately, I'm insanely backlogged and I only volunteer on this when I'm able, so I'd need specific and compelling data.