remotion-dev / remotion

🎥 Make videos programmatically with React
https://remotion.dev
Other
19.59k stars 954 forks source link

Lambda invocation stall detection too early, can exhaust retries #3966

Closed atticoos closed 3 weeks ago

atticoos commented 3 weeks ago

This ticket is the result of testing https://github.com/remotion-dev/remotion/pull/3963

Bug Report 🐛

Issue: Lambdas are marked as stalled too eagerly, which can result in exhausting all retries, and can end up spawning zombie renderers.

As a result, 2 lambda renderers may eventually spawn. But they've become detached from the launcher, as they're considered stalled.

This results in an irrecoverable render, as no more attempts may be made.

Short term recommendations

Long term recommendations

We may want to keep the invocations "in-band", where we can attempt a new invocation, but if an earlier one succeeds later, we can reclaim it's stream and use it.

It may also be worth considering this solution can result in detached/zombie rendering lambdas that may also be writing to progress.json in the S3 bucket, leading to race conditions & corrupted information (unless this is all handled by the launcher lambda & response streams -- less knowledgeable here)