oscarotero / Embed

Get info from any web service or page
MIT License
2.09k stars 310 forks source link

Memory leak #525

Closed nerg4l closed 10 months ago

nerg4l commented 10 months ago

I was trying to fetch oEmbed information in bulk but I run into multiple memory issues.

\HtmlParser\Parser

First, libxml_use_internal_errors(true) uses an internal buffer which can fill when libxml_clear_errors is not called. The most voted "User Contributed Notes" on https://www.php.net/manual/en/function.libxml-use-internal-errors.php mentions this.

When using this funtion, be sure to clear your internal error buffer. If you dn't and you are using this in a long running process, you may find that all your memory is used up.

https://github.com/oscarotero/html-parser/blob/0c5b619bdc7ac061f06a667d913e2af708ee3231/src/Parser.php#L77-L87

I fixed that by parsing in batch and calling libxml_clear_errors between each parsing.

\Embed\Http\CurlDispatcher

I wasn't able to completely verify the source of this leak. However, the error complains about stream_get_contents and running profiling it does shows that this method uses the most memory.

I replaced \Embed\Http\CurlDispatcher with \GuzzleHttp\Client and memory usage fall from 300+ MB to only 84 MB.

oscarotero commented 10 months ago

Hi @nerg4l Thanks for this!

Do you want to work on a PR to fix the Html parser?

And about CurlDispatcher, not sure why this can happen. The $this->body property is stream stored in memory: https://github.com/oscarotero/Embed/blob/master/src/Http/CurlDispatcher.php#L202

Maybe, after passing the stream to the response, it should be closed and clear here? https://github.com/oscarotero/Embed/blob/master/src/Http/CurlDispatcher.php#L139

nerg4l commented 10 months ago

Hello,

I will open a PR later today to fix the parser.

About the other leak, I'm not sure if the issue is with the stream function. I tried to use fclose on the body attribute but it did not help. I will try to do more tests to give the exact cause.

nerg4l commented 10 months ago

I'm not sure if #527 solves the problem but definitely reduces the occurrence.

nerg4l commented 10 months ago

I just made this change 51b1aea Can you test the memory usage?

https://github.com/oscarotero/Embed/pull/527#issuecomment-1833831559

Let's continue the conversation here.

Performance after the recent changes:

$embed = new Embed();
for ($i = 0; $i < 300; $i++) {
    $embed->getMulti('https://oscarotero.com', 'https://oscarotero.com')
}
memory_get_peak_usage() // 11951856 B
// Time: 01:01.863, Memory: 12.00 MB

With Guzzle client:

$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
for ($i = 0; $i < 300; $i++) {
    $embed->getMulti('https://oscarotero.com', 'https://oscarotero.com')
}
memory_get_peak_usage() // 13529336 B
// Time: 00:15.833, Memory: 16.00 MB

Calling gc_collect_cycles() once in a while can reduce peak memory usage to about 8.00 MB.

Would you accept a PR for adding a getClientFactory method and only fall back to CurlClient when it is necessary? That time difference is quite huge. In #527 I was testing Embed::get() and the above is for Embed::getMulti().

oscarotero commented 10 months ago

Would you accept a PR for adding a getClientFactory method and only fall back to CurlClient when it is necessary? That time difference is quite huge.

Okay, but I cannot understand why this difference. Guzzle uses CURL under the hood, so I cannot understand why this happens. Could be possible that Guzzle had a cache to avoid perform the same query twice? In your tests, you're doing the requests to the same url. The Embed implementation consume 4MB less of memory, the difference is only in the time.

nerg4l commented 10 months ago

Tested with 300 YouTube links:

$embed = new Embed();
memory_get_peak_usage(); // 105359848
// Time: 04:26.202, Memory: 114.00 MB

$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
// PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 778240 bytes) in /.../oscarotero/Embed/vendor/oscarotero/html-parser/src/Parser.php on line 42

I had to add gc_collect_cycles() to avoid memory exhaustion with Guzzle. The memory usage is less important due to the forced GC.

$embed = new Embed();
// Time: 04:56.609, Memory: 42.00 MB

$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
// Time: 03:08.485, Memory: 34.00 MB

It looks like you were correct partially and that now Guzzle's memory usage is the problematic one.

Code:

$n = count($this->links);
gc_collect_cycles();
for ($i = 0; $i < $n; $i += 2) {
    $embed->getMulti($this->links[$i], $this->links[$i+1]);
    if ($i % 20 == 0) {
        gc_collect_cycles(); // only for second case
    }
}
oscarotero commented 10 months ago

Great, so I guess the default CurlDispatcher is good enough, right? Let me know if you want to do more tests before releasing a new version.

nerg4l commented 10 months ago

I think it's good to go. I will open a new issue if needed.