Closed DanDubinsky closed 8 months ago
as I suggested in the previous issue, can you use https://github.com/davidmarkclements/0x ?
just console.log
is not enought for profiling memory usage, we need to understand what is happening inside node internals
Sure, here are two flame graphs. Looks like parsing.
Here's another hot section on the flame graph. If you would like to see the condition for yourself I think you can just run that index.js file. Just save one file as package.json
and the other as index.js
and then just yarn
and 0x -o index.js
.
Thanks, Dan
I have a question about the package. For these rules, how much of the HTML does it need to examine? Is it just meta tags?
require('metascraper-description')(), require('metascraper-image')(), require('metascraper-title')(), require('metascraper-url')(),
I'm thinking maybe I can write a simple preprocessor to strip out all but meta tags. With smaller HTML I'm hoping the parser will need a lot less memory. I'm starting to get a little bit desperate here. It seems that our end user have been entering a lot of links into our app that are triggering memory spikes and firing OOM errors. Maybe 30 or 40 out of memory errors in just the last day or so.
Interesting proposal. It depends on the package, e.g.: https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-title/src/index.js
Removing DOM elements clearly is going to help, but it could produce uncurated results.
Maybe you can test if minifying HTML has any impact? This seems promising: https://github.com/wilsonzlin/minify-html
Here's an update. I wasn't able to use that minify-html
lib. It gets errors on yarn add
for some reason. I tried this one instead https://www.npmjs.com/package/html-minifier and it fixes the memory issue, but it also returns all nulls for description, image and title properties, so some how the minified HTML is confusing the metascraper.
Now I'm trying something on the OS level and it seems promising so far. The containers are running Debian Linux and I was able to replace the default memory manager with jemalloc. Seems better so far. I did it at 15:00. There were lots of spikes and OOMs before and no spikes after for over an hour. Also the rss memory seems to be going up and down now instead of up and up. I'll leave it sit like this for a few days and see how it behaves and then report the findings back here in case anyone else has this issue.
glad to see you're mastering it!
I'm still interesting into explore how we can make metascraper consume lower, but it seems the memory issue should be fixed on cheerio upstream first: https://github.com/search?q=repo%3Acheeriojs%2Fcheerio+memory&type=issues
I'm going to determine if we can do something effective there, thanks for going deep with this.
@DanDubinsky can you test if helps to reduce memory consumption?
const { minify } = require('html-minifier-terser')
html = await minify(html.toString(), {
collapseWhitespace: true,
conservativeCollapse: true,
continueOnParseError: true,
removeComments: true,
collapseBooleanAttributes: true,
collapseInlineTagWhitespace: true,
includeAutoGeneratedTags: true,
keepClosingSlash: true,
minifyCSS: true,
minifyJS: true,
noNewlinesBeforeTagClose: true,
preserveLineBreaks: true,
})
await metascraper({ url, html });
Hey @Kikobeats,
The html-minifier-terser
package helped with the memory consumption to some degree (heap especially, rss to some degree), but it seems to have broken the metascraper because the output for the metadata are all null values:
Meta data {
description: null,
image: null,
title: null,
url: 'https://docs.google.com/document/d/1bGgGlc1YXSiR3cwbGDhuZUF7bz9djnTlV1qMP0-xhMM?usp=sharing'
}
Also it looks like using the jemalloc
memory manager only partially fixed the issue. It seemed to helped a lot with the slow memory leak I was seeing that was causing our servers to crash after about 3 days, even when users don't paste links to large files on our app. Memory seems to be holding steady between 180mb and 300mb. But it hasn't helped with the massive memory spikes we get with individual large files that cause the containers to crash at random times. We're still getting between 1 and 4 or those per day per container, depending on what the users post.
We seem to be getting fewer of them, but is most likely because our users aren't as active over weekends.
Next week I'm going to see if I can identify if the spike files have anything in common besides their size. We had this issue in the past with version 5.0.3 of the metascraper, but were able to work around it by skipping the scraping for html over 3mb. But the issue seems to be worse in the latest metascraper version. Here we are skipping all files over 2mb and it's still crashing. Maybe if the files spiking the memory have some other common attributes besides size, I can filter those out as well.
Thanks, Dan
Prerequisites
package.json
. Node 18.14.0Subject of the issue
High memory usage with some files, particularly rss memory.
Steps to reproduce
package.json
index.js
Expected behaviour
I would expect the memory usage to come down after the test was finished and the garbage collector ran
Actual behaviour
When scraping the url 7 times in a loop, it used over 1 gb of rss memory. Even after garbage collection most was not freed. Heap usage was also a little on the high side after garbage collection.