taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

Memory leak HTMLElement#text and HTMLElement#rawText #106

Closed ergcode closed 3 years ago

ergcode commented 3 years ago

Hey! A strange problem occurred https://youtu.be/TJSj7k4pyHo

I get the text from the element using "node-html-parser" and then pass it to the object.

const root = HTMLParser.parse(content);
let name = root.querySelector(`element`);

If I receive the text directly, then it leads to a memory leak (example in the video, the memory will grow until it leads to an error).

name = name.text;

But if I do the "cloning" then the RAM consumption will be normal.

name = JSON.parse(JSON.stringify(name.text));
taoqf commented 3 years ago

Sorry I could not get access to youtu.be because of my gov is blocking me. Maybe it is the same reason as https://github.com/taoqf/node-html-parser/issues/64 Could you pls provide some html so that I could help find the reason.

ergcode commented 3 years ago

Example file: view-source_https___epicentrk.ua_shopkirpich-ogneupornyy.zip Link for example: https://epicentrk.ua/shop/kirpich-ogneupornyy/ Video with memory leak (8.4Mb): https://user-images.githubusercontent.com/23062770/111903066-6962d180-8a51-11eb-8d31-dfc2cc84be15.mp4 The video in which the variable is cloned (7.9Mb): https://user-images.githubusercontent.com/23062770/111903133-c5c5f100-8a51-11eb-8214-8052a15f4295.mp4

I create an object available for visibility in all functions of this file.

const cat = {};

After requesting the page, I pass in "node-html-parser" the text from "content". And trying to get the text for the "name" variable.

        let root = HTMLParser.parse(content);
        let name;
        try {
            name = root.querySelector(`.shop-categories__title`);
            if (!name) name = root.querySelector(`.headList h1`);
            name = JSON.parse(JSON.stringify(name.text)).trim();
            // name = name.text.trim();
        } catch (error) { console.error(error); }

        if (!name) {
            console.error(`not found name`);
        }

I get the key "parent" and "lang" and, if necessary, create a new entry in "cat". Then I put there the key and the "name" value for the desired language.

        if (!cat[parent]) cat[parent] = {
            name: {},
        };
        cat[parent][`name`][lang] = name;

I may be wrong, but maybe "node-html-parser" binds some functions to a text variable? And for this reason, the garbage collector cannot clear the memory.

taoqf commented 3 years ago

It is really strange, I did some tests with your file, and could not find any clue at all.

taoqf commented 3 years ago

https://github.com/taoqf/node-html-parser/commit/8096749b4277dd67ed9d1353a3a3f7f8cd649506

ergcode commented 3 years ago

I redid your example to reproduce the error. const cat = {}; I have moved to the general scope. Now the data in this variable will accumulate. Previously, they were removed with each new cycle.

const fs = require(`fs`);
const HTMLParser = require(`node-html-parser`);

const cat = {};
const lang = 'en';

(async () => {
    const content = fs.readFileSync(`./view-source_https___epicentrk.ua_shop_kirpich-ogneupornyy_.html`, `utf-8`);
    let i = 0;
    while (++i < 10000) {
        let root = HTMLParser.parse(content);
        let name;
        const parent = Math.random().toString();
        try {
            name = root.querySelector(`.shop-categories__title`);
            if (!name) name = root.querySelector(`.headList h1`);
            // name = JSON.parse(JSON.stringify(name.text)).trim();
            // name = name.text.trim();
            name = name.text;
            if (!cat[parent]) {
                cat[parent] = {
                    name: {},
                };
            }
            cat[parent][`name`][lang] = name;
        } catch (error) { console.error(error); }

        if (!name) {
            console.error(`not found name`);
        }
    }
})();

As a result, cloning name = JSON.parse(JSON.stringify(name.text)).trim(); for 10,000 cycles showed stable memory consumption.

At the same time, writing data without cloning very quickly led to a memory overflow:

<--- Last few GCs --->

[4060:0x6199760]    48338 ms: Mark-sweep (reduce) 2046.5 (2066.4) -> 2046.2 (2065.6) MB, 13.0 / 0.0 ms  (+ 2.8 ms in 2 steps since start of marking, biggest step 2.7 ms, walltime since start of marking 17 ms) (average mu = 0.288, current mu = 0.226) final[4060:0x6199760]    48354 ms: Mark-sweep (reduce) 2046.5 (2065.6) -> 2046.2 (2065.6) MB, 11.0 / 0.0 ms  (+ 5.2 ms in 3 steps since start of marking, biggest step 2.8 ms, walltime since start of marking 17 ms) (average mu = 0.242, current mu = 0.182) final

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

1: 0xa877f0 node::Abort() [node]
 2: 0x9abe29 node::FatalError(char const*, char const*) [node]
 3: 0xc6ea6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xc6ede7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xe38865  [node]
 6: 0xe3940c  [node]
 7: 0xe46d9b v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 8: 0xe49485 v8::internal::Heap::HandleGCRequest() [node]
 9: 0xdec3e7 v8::internal::StackGuard::HandleInterrupts() [node]
10: 0x1129283 v8::internal::NativeRegExpMacroAssembler::CheckStackGuardState(v8::internal::Isolate*, int, v8::internal::RegExp::CallOrigin, unsigned long*, v8::internal::Code, unsigned long*, unsigned char const**, unsigned char const**) [node]
11: 0x141bd6c v8::internal::RegExpMacroAssemblerX64::CheckStackGuardState(unsigned long*, unsigned long, unsigned long) [node]
12: 0x38212b60768e 

So far, the quickest way to solve the problem is to clone the received text data. Thus, the variable loses its bound method references.I'll try to read your code this weekend.

taoqf commented 3 years ago

OK, I will do more tests about this.

taoqf commented 3 years ago

It is really strange, I tried several versions of node.js, it seems the same. I think it is something wrong with module he, but could not find the reason. in that lib, it just does some string replacing. anyway, I have do the clone thing in encoding. try the newest version, it may resolved in this case.

taoqf commented 3 years ago

This issue may be resolved after v3.1.1. Sorry for replying late.

taoqf commented 3 years ago

This issue may be resolved since v3.1.1. Sorry for replying late.

nonara commented 3 years ago

Hi all. Do we know for certain that this is resolved? If not, I'd like to take a look into it.

ergcode commented 3 years ago

Hi all. Do we know for certain that this is resolved? If not, I'd like to take a look into it.

"dependencies": { "node-html-parser": "^4.1.5" }

Sorry, I forgot to answer. Fixed issue, just ran tests. This request can be closed.

nonara commented 3 years ago

Great! Thanks for the diligence!