Terminate node if the ECS task has stopped to fast fail execution

wenduwan commented 1 year ago

What feature do you want to see added?

Currently the user job execution exhibits a "hang" behavior when the ECS task is stopped mid-flight, e.g. Fargate task being killed due to OOM error. For example, in a durable pipeline the controller will attempt to wait for the offline agent to recover, which is impossible in the context of stopped ECS task, until the step is timed out.

A more efficient approach is to terminate the node if the ECS task is "dead", i.e. it is stopped/being stopped. This allows the pipeline to determine that the agent is not recoverable and therefore abort the execution before timeout.

A similar feature has been implemented in the kubernetes plugin: https://issues.jenkins.io/browse/JENKINS-59340

Upstream changes

No response

wenduwan commented 1 year ago

I am currently preparing a patch for review.

Stericson commented 1 year ago

Merged. Thanks for the contribution!

jenkinsci / amazon-ecs-plugin

Terminate node if the ECS task has stopped to fast fail execution #312

What feature do you want to see added?

Upstream changes