microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.34k stars 166 forks source link

High memory usage #694

Closed DanDubinsky closed 8 months ago

DanDubinsky commented 9 months ago

Prerequisites

Subject of the issue

High memory usage with some files, particularly rss memory.

Steps to reproduce

package.json

{
  "name": "metascraperTest",
  "version": "1.0.0",
  "description": "Testing metascraper memory leak",
  "main": "index.js",
  "license": "MIT",
  "dependencies": {
    "got": "^11.8.6",
    "metascraper": "5.44.0",
    "metascraper-description": "5.44.0",
    "metascraper-image": "5.44.0",
    "metascraper-title": "5.44.0",
    "metascraper-url": "5.44.0"
  }
}

index.js

const v8 = require('v8');
const got = require('got');
const { setFlagsFromString } = require('v8');
const { runInNewContext } = require('vm');
const metascraper = require('metascraper')([
   require('metascraper-description')(),
   require('metascraper-image')(),
   require('metascraper-title')(),
   require('metascraper-url')(),
]);

const logMemory = (message) => {
   const mu = process.memoryUsage();
   const mbNow = mu.heapUsed / 1024 / 1024;
   const mbRounded = Math.round(mbNow);
   const mbTotal = mu.heapTotal / 1024 / 1024;
   const mbTotalRounded = Math.round(mbTotal);
   const mbRss = mu.rss / 1024 / 1024;
   const mbRssRounded = Math.round(mbRss);
   console.log(`${message} --- ${mbRounded}mb heap used,  ${mbTotalRounded}mb heap total, ${mbRssRounded}mb rss`);
};

const runTest = async () => {
   setFlagsFromString('--expose_gc');
   const gc = runInNewContext('gc');
   logMemory('Test starting');
   const url='https://docs.google.com/document/d/1bGgGlc1YXSiR3cwbGDhuZUF7bz9djnTlV1qMP0-xhMM?usp=sharing'
   let req = await got(url);
   let html = req.body;
   logMemory(`After fetching. HTML size is ${html.length} bytes`);
   for (let i = 0; i < 7; i++) {
      await metascraper({ url, html });
      logMemory(`After getting metadata`)
   }
   req=null;
   html=null;
   logMemory('Test finished, about to run garbage collection');
   gc();
   logMemory('After garbage collection');
};

runTest();

Expected behaviour

I would expect the memory usage to come down after the test was finished and the garbage collector ran

Actual behaviour

When scraping the url 7 times in a loop, it used over 1 gb of rss memory. Even after garbage collection most was not freed. Heap usage was also a little on the high side after garbage collection.

Test starting --- 31mb heap used,  57mb heap total, 99mb rss
After fetching. HTML size is 5084204 bytes --- 40mb heap used,  71mb heap total, 129mb rss
After getting metadata --- 211mb heap used,  247mb heap total, 312mb rss
After getting metadata --- 375mb heap used,  414mb heap total, 476mb rss
After getting metadata --- 559mb heap used,  601mb heap total, 667mb rss
After getting metadata --- 733mb heap used,  777mb heap total, 823mb rss
After getting metadata --- 907mb heap used,  953mb heap total, 1000mb rss
After getting metadata --- 1081mb heap used,  1130mb heap total, 1177mb rss
After getting metadata --- 1255mb heap used,  1307mb heap total, 1356mb rss
Test finished, about to run garbage collection --- 1255mb heap used,  1307mb heap total, 1356mb rss
After garbage collection --- 191mb heap used,  233mb heap total, 1268mb rss
Kikobeats commented 9 months ago

as I suggested in the previous issue, can you use https://github.com/davidmarkclements/0x ?

just console.log is not enought for profiling memory usage, we need to understand what is happening inside node internals

DanDubinsky commented 9 months ago

Sure, here are two flame graphs. Looks like parsing.

Screenshot 2024-02-14 at 8 57 17 PM Screenshot 2024-02-14 at 8 57 56 PM
DanDubinsky commented 9 months ago

Here's another hot section on the flame graph. If you would like to see the condition for yourself I think you can just run that index.js file. Just save one file as package.json and the other as index.js and then just yarn and 0x -o index.js.

Screenshot 2024-02-14 at 9 08 42 PM

Thanks, Dan

DanDubinsky commented 9 months ago

I have a question about the package. For these rules, how much of the HTML does it need to examine? Is it just meta tags?

require('metascraper-description')(), require('metascraper-image')(), require('metascraper-title')(), require('metascraper-url')(),

I'm thinking maybe I can write a simple preprocessor to strip out all but meta tags. With smaller HTML I'm hoping the parser will need a lot less memory. I'm starting to get a little bit desperate here. It seems that our end user have been entering a lot of links into our app that are triggering memory spikes and firing OOM errors. Maybe 30 or 40 out of memory errors in just the last day or so.

Screenshot 2024-02-16 at 9 01 36 AM
Kikobeats commented 9 months ago

Interesting proposal. It depends on the package, e.g.: https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-title/src/index.js

Removing DOM elements clearly is going to help, but it could produce uncurated results.

Maybe you can test if minifying HTML has any impact? This seems promising: https://github.com/wilsonzlin/minify-html

DanDubinsky commented 9 months ago

Here's an update. I wasn't able to use that minify-html lib. It gets errors on yarn add for some reason. I tried this one instead https://www.npmjs.com/package/html-minifier and it fixes the memory issue, but it also returns all nulls for description, image and title properties, so some how the minified HTML is confusing the metascraper.

Now I'm trying something on the OS level and it seems promising so far. The containers are running Debian Linux and I was able to replace the default memory manager with jemalloc. Seems better so far. I did it at 15:00. There were lots of spikes and OOMs before and no spikes after for over an hour. Also the rss memory seems to be going up and down now instead of up and up. I'll leave it sit like this for a few days and see how it behaves and then report the findings back here in case anyone else has this issue.

Screenshot 2024-02-16 at 4 12 29 PM
Kikobeats commented 9 months ago

glad to see you're mastering it!

I'm still interesting into explore how we can make metascraper consume lower, but it seems the memory issue should be fixed on cheerio upstream first: https://github.com/search?q=repo%3Acheeriojs%2Fcheerio+memory&type=issues

I'm going to determine if we can do something effective there, thanks for going deep with this.

Kikobeats commented 9 months ago

@DanDubinsky can you test if helps to reduce memory consumption?


const { minify } = require('html-minifier-terser')

html = await minify(html.toString(), {
  collapseWhitespace: true,
  conservativeCollapse: true,
  continueOnParseError: true,
  removeComments: true,
  collapseBooleanAttributes: true,
  collapseInlineTagWhitespace: true,
  includeAutoGeneratedTags: true,
  keepClosingSlash: true,
  minifyCSS: true,
  minifyJS: true,
  noNewlinesBeforeTagClose: true,
  preserveLineBreaks: true,
})

await metascraper({ url, html });
DanDubinsky commented 9 months ago

Hey @Kikobeats,

The html-minifier-terser package helped with the memory consumption to some degree (heap especially, rss to some degree), but it seems to have broken the metascraper because the output for the metadata are all null values:

Meta data {
  description: null,
  image: null,
  title: null,
  url: 'https://docs.google.com/document/d/1bGgGlc1YXSiR3cwbGDhuZUF7bz9djnTlV1qMP0-xhMM?usp=sharing'
}

Also it looks like using the jemalloc memory manager only partially fixed the issue. It seemed to helped a lot with the slow memory leak I was seeing that was causing our servers to crash after about 3 days, even when users don't paste links to large files on our app. Memory seems to be holding steady between 180mb and 300mb. But it hasn't helped with the massive memory spikes we get with individual large files that cause the containers to crash at random times. We're still getting between 1 and 4 or those per day per container, depending on what the users post.

Screenshot 2024-02-18 at 9 58 33 AM

We seem to be getting fewer of them, but is most likely because our users aren't as active over weekends.

Screenshot 2024-02-18 at 10 03 25 AM

Next week I'm going to see if I can identify if the spike files have anything in common besides their size. We had this issue in the past with version 5.0.3 of the metascraper, but were able to work around it by skipping the scraping for html over 3mb. But the issue seems to be worse in the latest metascraper version. Here we are skipping all files over 2mb and it's still crashing. Maybe if the files spiking the memory have some other common attributes besides size, I can filter those out as well.

Thanks, Dan