This ticket is the result of testing https://github.com/remotion-dev/remotion/pull/3963

Bug Report 🐛

Issue: Lambdas are marked as stalled too eagerly, which can result in exhausting all retries, and can end up spawning zombie renderers.

Lambda invocations may take more than 7 seconds, but become marked as stalled too eagerly https://github.com/remotion-dev/remotion/blob/81bed4fa0e81ecd8c9d700ab08caa5825c001ea0/packages/lambda/src/shared/call-lambda.ts#L115
We have 1 opportunity to retry, however it's likely the next invocation can fall outside the 7 second window, and no longer be retried or stream back to the launcher https://github.com/remotion-dev/remotion/blob/81bed4fa0e81ecd8c9d700ab08caa5825c001ea0/packages/lambda/src/functions/helpers/stream-renderer.ts#L127

As a result, 2 lambda renderers may eventually spawn. But they've become detached from the launcher, as they're considered stalled.

This results in an irrecoverable render, as no more attempts may be made.

Short term recommendations

Let's increase the timeout considerably, perhaps 30 seconds?
Let's increase the retriesRemaining, or base it off of payload.maxRetries to allow for user configuration?

Long term recommendations

We may want to keep the invocations "in-band", where we can attempt a new invocation, but if an earlier one succeeds later, we can reclaim it's stream and use it.

It may also be worth considering this solution can result in detached/zombie rendering lambdas that may also be writing to progress.json in the S3 bucket, leading to race conditions & corrupted information (unless this is all handled by the launcher lambda & response streams -- less knowledgeable here)

remotion-dev / remotion

Lambda invocation stall detection too early, can exhaust retries #3966

Bug Report 🐛

Short term recommendations

Long term recommendations