ulixee / hero

The web browser built for scraping
MIT License
847 stars 42 forks source link

[Heap out of Memory] - Session Db not cleaned up and locked by process #312

Open SamuelD28 opened 2 weeks ago

SamuelD28 commented 2 weeks ago

Hi,

Love the project so far, just having this issue which is a bit of a problem when running on limited ressources. I am still investigating the issue and going trough the ulixee code base but for the sake of speed i am going to post the issue here also.

The .db file is not removed at the end of a session even with sessionPersistence set to false. The file is also locked up by the process which makes it impossible to remove. Eventually it leads to an heap out of memory which i confirmed using V8 profiler.

Tested it on Window 11 and Ubuntu 20.0.4 running inside Docker, same result.

Here is a minimal setup to reproduce. 1 - Copy both files listed below to a directory 2 - Cd into directory 3 - open a terminal and run "node client.js" 4 - open a terminal and run "node main.js"

As you can see, the heap usage keeps increasing and the .db files in /tmpDirectory/hero-sessions/ are not deleted at the of a session. Attempting to manually delete those files results in a EBUSY error

// client.js
import { CloudNode } from '@ulixee/cloud';
import { Session } from '@ulixee/hero-core';
import { memoryUsage } from 'process';

async function log(type, message) {
  const now = new Date().toUTCString();
  console.log(
    `\x1b[35m[${type}] \x1b[0m\x1b[33m[${now}]\x1b[0m -> ${message}`
  );
}

function getMemoryUsage() {
  const memory = memoryUsage();
  return `rss: ${memory.rss / 1000000}mb, arrayBuffers: ${memory.arrayBuffers / 1000000}mb, external: ${memory.external / 1000000}mb, heapTotal: ${memory.heapTotal / 1000000}mb, heapUsed: ${memory.heapUsed / 1000000}mb`;
}

(async () => {
  const cloudNode = new CloudNode({
    port: process.env.SCRAPER_CORE_PORT,
    rejectUnauthorized: true,
  });

  Session.events.on("new", function (data) {
    log('Perf', getMemoryUsage());
    log("Event", `Opening new connection: ${data.session.id}`);
    log("Info", `Currrent listener count: ${cloudNode.heroCore.connections.size}`);
  });

  await cloudNode.listen();
  log('Event', `CloudNode started on port: ${await cloudNode.port}`);
})().catch(error => {
  log('Event', 'Error starting Ulixee CloudNode', error);
  process.exit(1);
});
// main.js
import Hero, { ConnectionToHeroCore } from "@ulixee/hero";

function wait() {
  return new Promise((resolve) => {
    setTimeout(() => resolve(), 5000);
  })
}

async function main() {
  for (; ;) {
    console.log("fetching")
    const connectionToCore = ConnectionToHeroCore.remote("ws://localhost:1818");
    const browser = new Hero({
      connectionToCore,
      showChromeInteractions: false,
      showChrome: false,
      sessionPersistence: false,
      sessionKeepAlive: false,
    });

    const tab = await browser.newTab();
    await tab.goto("https://google.com");

    await wait();

    await browser.close();
    await connectionToCore.disconnect();
  }
}

main().then(() => {
  console.log("started");
})
StuartFuller commented 3 days ago

I have also had this issue - the sqllite DB just continues to grow and eventually uses up all disk space. Not ideal for a production process.

I've create a simple script using recommended code from the docs

https://ulixee.org/docs/hero/advanced-concepts/sessions#sessions https://ulixee.org/docs/hero/advanced-concepts/deployment

import { CloudNode } from '@ulixee/cloud';
import { Session } from '@ulixee/hero-core';
import * as Fs from 'fs';

(async () => {
  Session.events.on('closed', async ({ id, databasePath }) => {

    Fs.unlink(databasePath, (err) => {
      if (err) throw err; //FAILS due to file being busy
      console.log(databasePath);
    });
  });

  const cloudNode = new CloudNode();
  await cloudNode.listen({ port: 1818 });
})();

The session closed event fires, but when it attempts to FS.unlink I get 'EBUSY: resource busy or locked'

SamuelD28 commented 3 days ago

@StuartFuller

I tried to dwelve into the code base, see if it was a quickish fix perhaps. It kind of became a rats nest issue and I was short on time and not very knowledgable about the current project architecture. Here's what I found in regards to .db files not cleaned up.

Two files important

DefaultSessionRegistry HeroSessionsSearch

In HeroSessionSearch, the sessions are accessed with retain() which increments the connection count. When we close the session, it checks if the connection count is < 1 (make sense) but in practice the count is always 2. So the session dbs are never closed since there are active connections. I tried using the get() instead of retain() in HeroSessionSearch which fixed the issue, the .db files are removed.

Buttt the Heap Out of Memory is still there. I profiled the memory and noticed that sessions data (in my case, around 5-10mb per session) are never removed from the memory because there are active references in a couple of places. This is where i left off, my knowledge being very limited about this project, I could not remove those references easily without breaking many things. I am in a crunch right now so i can't afford to spend more time on the issue so i went back to puppeteer with stealth plugins. When time is right, ill try to fix it, since i like my experience so far and the bots were reliably undetected.

Cheers!

StuartFuller commented 3 days ago

@blakebyrnes Is this anything you could shed some light on? Seems like there is a memory leak of some sort which could also be preventing the db clean-ups.

blakebyrnes commented 3 days ago

This looks like it's happening because of Ulixee desktop apis. There's an env var to turn off the apis that might help, but I haven't tested: ULX_DISABLE_DESKTOP_APIS=true