Currently the user job execution exhibits a "hang" behavior when the ECS task is stopped mid-flight, e.g. Fargate task being killed due to OOM error. For example, in a durable pipeline the controller will attempt to wait for the offline agent to recover, which is impossible in the context of stopped ECS task, until the step is timed out.
A more efficient approach is to terminate the node if the ECS task is "dead", i.e. it is stopped/being stopped. This allows the pipeline to determine that the agent is not recoverable and therefore abort the execution before timeout.
What feature do you want to see added?
Currently the user job execution exhibits a "hang" behavior when the ECS task is stopped mid-flight, e.g. Fargate task being killed due to OOM error. For example, in a durable pipeline the controller will attempt to wait for the offline agent to recover, which is impossible in the context of stopped ECS task, until the step is timed out.
A more efficient approach is to terminate the node if the ECS task is "dead", i.e. it is stopped/being stopped. This allows the pipeline to determine that the agent is not recoverable and therefore abort the execution before timeout.
A similar feature has been implemented in the kubernetes plugin: https://issues.jenkins.io/browse/JENKINS-59340
Upstream changes
No response