Lambda execution randomly times out

orkhanahmadov commented 10 months ago

We never had this issue with v1, but since upgrading to v2, the following exception getting thrown randomly when trying to generate a PDF:

Lambda Execution Exception for Wnx\SidecarBrowsershot\Functions\BrowsershotFunction: Task timed out after 300.11 seconds.

I saw another reported issue with #100, but that one seems to be related to protocol timeout, this one is related to Lambda's 30-second timeout. Also unlike #100, we don't try to generate huge PDFs. It is a single-page PDF page and it doesn't always fail but is completely random. Fails once then trying again it succeeds.

Any clues? Anything in v2 that can make lambda substantially longer to execute?

stefanzweifel commented 10 months ago

@orkhanahmadov From which version did you upgrade? From v1.13.1? Or from something prior?

In this compare view, you can see that between v1.13.1 to v2.0.0 basically nothing changed, besides the new spatie/browsershot:v4.0 requirement and the change to the config file.

I didn't have yet the time to upgrade my own apps to the latest version, so can't speak from experience. Weird that our test suite in the package, that runs on AWS, is green. 🤔

orkhanahmadov commented 8 months ago

@stefanzweifel ok, turns out it is not related to v2, I guess.

Recently when the timeout happened again, we noticed a CloudWatch log with the following contents:

CleanShot 2024-02-15 at 17 26 12@2x

Is this useful or related?

stefanzweifel commented 8 months ago

Thanks for the update @orkhanahmadov. Seems related to a cleanup bit of code: https://github.com/stefanzweifel/sidecar-browsershot/blob/081c3918ad3634146bc0ab2c565d828dd143b518/resources/lambda/browsershot.js#L81

Will create a PR with a fix soonish.

stefanzweifel commented 8 months ago

@orkhanahmadov Would you able to share some code snippets for this issue. I'm not able to replicate your timeout issue on my machine or in my production apps.

I don't assume you run any JavaScript in your Blade views? I know this sounds like a not ideal solution, but have you tried increasing the timeout of your function? Does the error still occur?

Maybe this is related to the underlying Puppeteer versions. Will work on upgrading the underlying layer to the latest puppeteer-version.

orkhanahmadov commented 8 months ago

@stefanzweifel This is a weird issue that happens completely randomly, we have no clue what exactly causes it or how to reproduce it. Because when this timeout happens we try it one more time generating the same PDF and everything works... The last time when timeout happened the only clue we got was that CloudWatch log.

Initially, the timeout was 30 seconds, we tried to increase it to 300 seconds, but it still didn't help. When this timeout happens lambda gets "stuck", no amount of timeout helps.

What we did as a workaround:

lowered browsershot timeout to 30 seconds. when everything works, usually we get the PDF back < 5 seconds
added a loop of retries up to 5 times, whenever LambdaExecutionException happens

use Hammerstone\Sidecar\Exceptions\LambdaExecutionException;

private int $retry = 0;

public function render(): string
{
    try {
        return $this->browser
            ->setHtml($this->html)
            ->format($this->format)
            ->margins(...$this->margins)
            ->pdf();
    } catch (LambdaExecutionException $exception) {
        if ($this->retry < 4) { // 5 times in total, including the first attempt
            $this->retry++;

            return $this->render();
        }

        throw $exception;
    }
}

I suspect maybe it is related to the underlying library... The CloudWatch log says the deprecation warning is related to fs.rmdir, not fs.rmdirSync which is package is using. Maybe we can try node --trace-deprecation on the layer. There are some reported and closed issues on the puppeteer repository: https://github.com/puppeteer/puppeteer/issues?q=is%3Aissue+rmdir

I also found reported similar issues related to puppeteer-extra, but I believe this library is not being used here.

stefanzweifel commented 8 months ago

@orkhanahmadov Thanks! Lowering the timeout makes total sense here. Better fail fast than wait for forever and produce unnecessary cost.

Will do some research. 🤓

stefanzweifel commented 8 months ago

As you might have seen, my attempt (#112) at updating our internal code to no longer use fs.rmdir was not successful. Have to figure out, what the root issue here is.

In the meantime, I've updated an underlying layer to use the latest puppeteer-core version and updated this package to also use an updated Chromium version: https://github.com/stefanzweifel/sidecar-browsershot/releases/tag/v2.1.0

Can't guarantee that this will solve the issues.

EvertonNeri commented 1 week ago

Hello @orkhanahmadov, did you find a solution to this problem?

This is happening to me too.

For the same site: sometimes it times out after 300s or net::ERR_TUNNEL_CONNECTION_FAILED

This happens to all sites, if I take a screenshot several times, many fail with this problem

obs: I set it to always try 5 times, and it often failed all 5 times.

My setup: "wnx/sidecar-browsershot": "^2.3" sidecar-browsershot-layer: 2 chrome-aws-lambda: 42 AWS Lambda region: us-east-1 (N. Virginia)

stefanzweifel / sidecar-browsershot

Lambda execution randomly times out #110