ulixee / hero

The web browser built for scraping
MIT License
647 stars 32 forks source link

JavaScript heap out of memory #262

Open mrxdev-git opened 1 month ago

mrxdev-git commented 1 month ago

Hi, I'm trying to parse around 5000 links from a site that is JS-rendered, and every several hundred requests I get this error:

<--- Last few GCs --->

[67989:0x120008000]  2914635 ms: Scavenge 1966.5 (2080.3) -> 1962.5 (2080.3) MB, 3.17 / 0.00 ms  (average mu = 0.938, current mu = 0.960) task; 
[67989:0x120008000]  2914702 ms: Scavenge 1970.3 (2082.7) -> 1964.0 (2080.4) MB, 2.71 / 0.04 ms  (average mu = 0.938, current mu = 0.960) task; 
[67989:0x120008000]  2914939 ms: Scavenge 1969.7 (2080.4) -> 1965.1 (2096.4) MB, 10.29 / 0.00 ms  (average mu = 0.938, current mu = 0.960) task; 

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0x102f6cbf4 node::Abort() [/usr/local/bin/node]
 2: 0x102f6cddc node::ModifyCodeGenerationFromStrings(v8::Local<v8::Context>, v8::Local<v8::Value>, bool) [/usr/local/bin/node]
 3: 0x1030f0da8 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/usr/local/bin/node]
 4: 0x1032c56e8 v8::internal::Heap::GarbageCollectionReasonToString(v8::internal::GarbageCollectionReason) [/usr/local/bin/node]
 5: 0x1032c41c4 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node]
 6: 0x10332afd0 v8::internal::MinorGCJob::Task::RunInternal() [/usr/local/bin/node]
 7: 0x102fcdfc4 node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>) [/usr/local/bin/node]
 8: 0x102fccce0 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/usr/local/bin/node]
 9: 0x10393c5b4 uv__async_io [/usr/local/bin/node]
10: 0x10394e68c uv__io_poll [/usr/local/bin/node]
11: 0x10393cb78 uv_run [/usr/local/bin/node]
12: 0x102e9d754 node::SpinEventLoopInternal(node::Environment*) [/usr/local/bin/node]
13: 0x102fac8d8 node::NodeMainInstance::Run(node::ExitCode*, node::Environment*) [/usr/local/bin/node]
14: 0x102fac674 node::NodeMainInstance::Run() [/usr/local/bin/node]
15: 0x102f37030 node::Start(int, char**) [/usr/local/bin/node]
16: 0x18642bf28 start [/usr/lib/dyld]

Any ideas why this error could be caused and how to solve it? I tested on MacOS Ventura 13.5.2 M2 and Debian 11.

blakebyrnes commented 1 month ago

It seems like either you are not closing Hero sessions, or there's some kind of leak. Can you share a simple reproducible example?

mrxdev-git commented 1 month ago

@blakebyrnes Here is a simplified version of my code, It processes around 600 hundred links and then crashes with the error above. When it starts out it goes fast and easy, but as it goes further it gets slower and slower until it crashes.

import fs from 'fs';
import HeroCore from '@ulixee/hero-core';
import {TransportBridge} from "@ulixee/net";
import Hero, {ConnectionToHeroCore} from "@ulixee/hero";

function readUrlsFromFile(filePath) {
    try {
        const fileContent = fs.readFileSync(filePath, 'utf-8');
        return fileContent.split('\n').map(line => line.trim()).filter(line => line.length > 0);
    } catch (error) {
        console.error('Error reading the file:', error);
        throw error;
    }
}

(async () => {
    const links = readUrlsFromFile('links.txt')

    const bridge = new TransportBridge();
    const connectionToCore = new ConnectionToHeroCore(bridge.transportToCore);

    const heroCore = new HeroCore();
    heroCore.addConnection(bridge.transportToClient);

    const options = {
        connectionToCore,
        blockedResourceTypes: [
            'BlockImages',
            'BlockCssAssets',
            'BlockFonts',
            'BlockMedia',
            'BlockIcons'
        ],
        viewport: {
            width: 1280,
            height: 1024
        },
        showChromeInteractions: false,
        showChrome: false,
        sessionPersistence: false
    };

    const browser = new Hero(options);

    try {
        for await (const link of links) {
            try {
                await browser.goto(link);
                // await browser.waitForPaintingStable();

                const price_tag = await browser.waitForElement(browser.xpathSelector(
                    "//span[text()[contains(.,'Some text')]]"
                ), {
                    timeoutMs: 10e3
                })

                if (price_tag) {
                    const price = await price_tag.parentNode.querySelector('div > span').textContent;
                    console.log(price)
                } else {
                    throw new Error('No card price tag')
                }
            } catch (er) {
                console.log(er.message)
            }

            await browser.waitForMillis(2e3)
        }
    } catch (err) {
        console.log(err.message)
    } finally {
        await browser.close();
    }
})()
blakebyrnes commented 1 month ago

Got it. Thanks.

This approach won't work super well with Hero. You're unintentionally creating a single hero session for all your activities. Hero is built to handle each of your links in a single session (or some small subset that might be considered a single "action" by a user. You will have better luck with the way it's designed if you're able to break things up into a smaller set of chunks (like do batches of 100 or something).

Every time you "close" a session, Hero can clean up all the resources/navigation/etc it has collected. Hero acts like you might still want to act on that information, so it keeps it around, because it is built assuming you are reacting to items created during the "session".

mrxdev-git commented 1 month ago

@blakebyrnes Thank you so much for the explanation, everything is working fine now.

mehrdad-shokri commented 1 day ago

apparently this fixes the issue:

  await connectionToCore.disconnect();
  await heroCore.close();