Closed nerg4l closed 10 months ago
Hi @nerg4l Thanks for this!
Do you want to work on a PR to fix the Html parser?
And about CurlDispatcher, not sure why this can happen. The $this->body
property is stream stored in memory:
https://github.com/oscarotero/Embed/blob/master/src/Http/CurlDispatcher.php#L202
Maybe, after passing the stream to the response, it should be closed and clear here? https://github.com/oscarotero/Embed/blob/master/src/Http/CurlDispatcher.php#L139
Hello,
I will open a PR later today to fix the parser.
About the other leak, I'm not sure if the issue is with the stream function. I tried to use fclose
on the body attribute but it did not help. I will try to do more tests to give the exact cause.
I'm not sure if #527 solves the problem but definitely reduces the occurrence.
I just made this change 51b1aea Can you test the memory usage?
https://github.com/oscarotero/Embed/pull/527#issuecomment-1833831559
Let's continue the conversation here.
Performance after the recent changes:
$embed = new Embed();
for ($i = 0; $i < 300; $i++) {
$embed->getMulti('https://oscarotero.com', 'https://oscarotero.com')
}
memory_get_peak_usage() // 11951856 B
// Time: 01:01.863, Memory: 12.00 MB
With Guzzle client:
$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
for ($i = 0; $i < 300; $i++) {
$embed->getMulti('https://oscarotero.com', 'https://oscarotero.com')
}
memory_get_peak_usage() // 13529336 B
// Time: 00:15.833, Memory: 16.00 MB
Calling gc_collect_cycles()
once in a while can reduce peak memory usage to about 8.00 MB.
Would you accept a PR for adding a getClientFactory
method and only fall back to CurlClient
when it is necessary? That time difference is quite huge. In #527 I was testing Embed::get()
and the above is for Embed::getMulti()
.
Would you accept a PR for adding a getClientFactory method and only fall back to CurlClient when it is necessary? That time difference is quite huge.
Okay, but I cannot understand why this difference. Guzzle uses CURL under the hood, so I cannot understand why this happens. Could be possible that Guzzle had a cache to avoid perform the same query twice? In your tests, you're doing the requests to the same url. The Embed implementation consume 4MB less of memory, the difference is only in the time.
Tested with 300 YouTube links:
$embed = new Embed();
memory_get_peak_usage(); // 105359848
// Time: 04:26.202, Memory: 114.00 MB
$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
// PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 778240 bytes) in /.../oscarotero/Embed/vendor/oscarotero/html-parser/src/Parser.php on line 42
I had to add gc_collect_cycles()
to avoid memory exhaustion with Guzzle. The memory usage is less important due to the forced GC.
$embed = new Embed();
// Time: 04:56.609, Memory: 42.00 MB
$embed = new Embed(new \Embed\Http\Crawler(new \GuzzleHttp\Client()));
// Time: 03:08.485, Memory: 34.00 MB
It looks like you were correct partially and that now Guzzle's memory usage is the problematic one.
Code:
$n = count($this->links);
gc_collect_cycles();
for ($i = 0; $i < $n; $i += 2) {
$embed->getMulti($this->links[$i], $this->links[$i+1]);
if ($i % 20 == 0) {
gc_collect_cycles(); // only for second case
}
}
Great, so I guess the default CurlDispatcher is good enough, right? Let me know if you want to do more tests before releasing a new version.
I think it's good to go. I will open a new issue if needed.
I was trying to fetch oEmbed information in bulk but I run into multiple memory issues.
\HtmlParser\Parser
First,
libxml_use_internal_errors(true)
uses an internal buffer which can fill whenlibxml_clear_errors
is not called. The most voted "User Contributed Notes" on https://www.php.net/manual/en/function.libxml-use-internal-errors.php mentions this.https://github.com/oscarotero/html-parser/blob/0c5b619bdc7ac061f06a667d913e2af708ee3231/src/Parser.php#L77-L87
I fixed that by parsing in batch and calling
libxml_clear_errors
between each parsing.\Embed\Http\CurlDispatcher
I wasn't able to completely verify the source of this leak. However, the error complains about
stream_get_contents
and running profiling it does shows that this method uses the most memory.I replaced
\Embed\Http\CurlDispatcher
with\GuzzleHttp\Client
and memory usage fall from 300+ MB to only 84 MB.