As a result, 2 lambda renderers may eventually spawn. But they've become detached from the launcher, as they're considered stalled.
This results in an irrecoverable render, as no more attempts may be made.
Short term recommendations
Let's increase the timeout considerably, perhaps 30 seconds?
Let's increase the retriesRemaining, or base it off of payload.maxRetries to allow for user configuration?
Long term recommendations
We may want to keep the invocations "in-band", where we can attempt a new invocation, but if an earlier one succeeds later, we can reclaim it's stream and use it.
It may also be worth considering this solution can result in detached/zombie rendering lambdas that may also be writing to progress.json in the S3 bucket, leading to race conditions & corrupted information (unless this is all handled by the launcher lambda & response streams -- less knowledgeable here)
This ticket is the result of testing https://github.com/remotion-dev/remotion/pull/3963
Bug Report 🐛
Issue: Lambdas are marked as stalled too eagerly, which can result in exhausting all retries, and can end up spawning zombie renderers.
Lambda invocations may take more than 7 seconds, but become marked as stalled too eagerly https://github.com/remotion-dev/remotion/blob/81bed4fa0e81ecd8c9d700ab08caa5825c001ea0/packages/lambda/src/shared/call-lambda.ts#L115
We have 1 opportunity to retry, however it's likely the next invocation can fall outside the 7 second window, and no longer be retried or stream back to the launcher https://github.com/remotion-dev/remotion/blob/81bed4fa0e81ecd8c9d700ab08caa5825c001ea0/packages/lambda/src/functions/helpers/stream-renderer.ts#L127
As a result, 2 lambda renderers may eventually spawn. But they've become detached from the launcher, as they're considered stalled.
This results in an irrecoverable render, as no more attempts may be made.
Short term recommendations
retriesRemaining
, or base it off ofpayload.maxRetries
to allow for user configuration?Long term recommendations
We may want to keep the invocations "in-band", where we can attempt a new invocation, but if an earlier one succeeds later, we can reclaim it's stream and use it.
It may also be worth considering this solution can result in detached/zombie rendering lambdas that may also be writing to
progress.json
in the S3 bucket, leading to race conditions & corrupted information (unless this is all handled by the launcher lambda & response streams -- less knowledgeable here)