jenkinsci / amazon-ecs-plugin

Amazon EC2 Container Service Plugin for Jenkins
https://plugins.jenkins.io/amazon-ecs
MIT License
192 stars 227 forks source link

Very often get two Fargate tasks when only one agent is needed #311

Open helloPiers opened 1 year ago

helloPiers commented 1 year ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.394 OS: Linux - 5.10.167-147.601.amzn2.x86_64 Java: 17.0.6 - Eclipse Adoptium (OpenJDK 64-Bit Server VM) --- amazon-ecs:1.47 apache-httpcomponents-client-4-api:4.5.14-150.v7a_b_9d17134a_5 aws-credentials:191.vcb_f183ce58b_9 aws-java-sdk-ec2:1.12.406-370.v8f993c987059 aws-java-sdk-ecs:1.12.406-370.v8f993c987059 aws-java-sdk-efs:1.12.406-370.v8f993c987059 aws-java-sdk-minimal:1.12.406-370.v8f993c987059 bootstrap5-api:5.2.1-3 bouncycastle-api:2.27 branch-api:2.1071.v1a_188a_562481 caffeine-api:2.9.3-65.v6a_47d0f4d1fe checks-api:1.8.1 cloudbees-folder:6.815.v0dd5a_cb_40e0e command-launcher:90.v669d7ccb_7c31 commons-lang3-api:3.12.0-36.vd97de6465d5b_ commons-text-api:1.10.0-36.vc008c8fcda_7b_ configuration-as-code:1569.vb_72405b_80249 configurationslicing:430.v966357576543 credentials:1214.v1de940103927 credentials-binding:523.vd859a_4b_122e6 data-tables-api:1.12.1-4 display-url-api:2.3.7 durable-task:504.vb10d1ae5ba2f echarts-api:5.4.0-1 font-awesome-api:6.2.1-1 git:5.0.0 git-client:4.1.0 instance-identity:142.v04572ca_5b_265 ionicons-api:31.v4757b_6987003 jackson2-api:2.14.2-319.v37853346a_229 jakarta-activation-api:2.0.1-3 jakarta-mail-api:2.0.1-3 javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.8-1 jdk-tool:63.v62d2fd4b_4793 jquery3-api:3.6.1-2 junit:1177.v90374a_ef4d09 lockable-resources:1131.vb_7c3d377e723 mailer:448.v5b_97805e3767 matrix-project:785.v06b_7f47b_c631 metrics:4.2.13-420.vea_2f17932dd6 mina-sshd-api-common:2.9.2-50.va_0e1f42659a_a mina-sshd-api-core:2.9.2-50.va_0e1f42659a_a next-build-number:1.8 pipeline-build-step:486.vd08f550cceee pipeline-graph-analysis:202.va_d268e64deb_3 pipeline-graph-view:175.vb_1a_b_b_cd0fb_86 pipeline-groovy-lib:629.vb_5627b_ee2104 pipeline-input-step:466.v6d0a_5df34f81 pipeline-milestone-step:111.v449306f708b_7 pipeline-model-api:2.2118.v31fd5b_9944b_5 pipeline-model-definition:2.2118.v31fd5b_9944b_5 pipeline-model-extensions:2.2118.v31fd5b_9944b_5 pipeline-rest-api:2.31 pipeline-stage-step:305.ve96d0205c1c6 pipeline-stage-tags-metadata:2.2118.v31fd5b_9944b_5 pipeline-stage-view:2.31 plain-credentials:143.v1b_df8b_d3b_e48 plugin-util-api:2.20.0 popper2-api:2.11.6-2 scm-api:631.v9143df5b_e4a_a script-security:1229.v4880b_b_e905a_6 snakeyaml-api:1.33-95.va_b_a_e3e47b_fa_4 ssh-credentials:305.v8f4381501156 sshd:3.275.v9e17c10f2571 structs:324.va_f5d6774f3a_d trilead-api:2.84.v72119de229b_7 variant:59.vf075fe829ccb workflow-aggregator:596.v8c21c963d92d workflow-api:1208.v0cc7c6e0da_9e workflow-basic-steps:1010.vf7a_b_98e847c1 workflow-cps:3641.vf58904a_b_b_5d8 workflow-durable-task-step:1234.v019404b_3832a workflow-job:1282.ve6d865025906 workflow-multibranch:733.v109046189126 workflow-scm-step:400.v6b_89a_1317c9a_ workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:839.v35e2736cfd5c ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Linux in Fargate containers, using images based on jenkins/jenkins:alpine-jdk17 and jenkins/inbound-agent:jdk17.

Reproduction steps

  1. Wait until no ECS Cloud agents are running
  2. Start a job that needs a agent (main node does not run jobs)

Expected Results

A new Fargate Task for a new Agent node started up, does the Job.

Actual Results

A new Fargate Task is started by the ECS Cloud and runs the Job. Meanwhile a second, identical Fargate Task starts and does nothing as there are no other jobs to run.

After the idle period both are stopped.

Anything else?

I've tried adjusting various timeouts and other values such as poll times for Tasks, to no avail.

It seems like the Task is not considered started quickly enough to avoid Jenkins asking again for a provision?

There are two c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic entries - here are what look like the important logs extracted from the full log included below:

2023-03-13 17:13:52.645+0000 [id=41]    INFO    c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic
2023-03-13 17:14:43.191+0000 [id=6870]  INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #16 from /10.28.137.77:48718

2023-03-13 17:14:52.645+0000 [id=29]    INFO    c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic

2023-03-13 17:15:18.407+0000 [id=6855]  INFO    c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: Task started, waiting for agent to become online
2023-03-13 17:15:18.407+0000 [id=6855]  INFO    c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-211x7]: Agent connected
2023-03-13 17:15:18.459+0000 [id=6856]  INFO    c.c.j.p.amazonecs.ECSComputer#taskAccepted: [fargate-agent-cloud-generic-211x7]: JobName: FreeStyle1

2023-03-13 17:16:15.305+0000 [id=6901]  INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #18 from /10.28.107.11:55180
2023-03-13 17:16:18.565+0000 [id=6874]  INFO    c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: Task started, waiting for agent to become online
2023-03-13 17:16:18.565+0000 [id=6874]  INFO    c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-zfqhk]: Agent connected

Node fargate-agent-cloud-generic-211x7 (first to launch) carried out the job.

Full controller node log ```text 2023-03-13 17:13:52.645+0000 [id=41] INFO c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic 2023-03-13 17:13:52.645+0000 [id=41] INFO c.c.j.plugins.amazonecs.ECSCloud#provision: Will provision fargate-agent-cloud-generic-211x7, for label: generic 2023-03-13 17:14:02.682+0000 [id=40] INFO hudson.slaves.NodeProvisioner#update: fargate-agent-cloud-generic-211x7 provisioning successfully completed. We have now 2 computer(s) 2023-03-13 17:14:02.721+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-211x7]: Starting agent with task definition arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16} 2023-03-13 17:14:03.293+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-211x7]: Agent started with task arn : arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b 2023-03-13 17:14:03.293+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: TaskArn: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b 2023-03-13 17:14:03.293+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: TaskDefinitionArn: arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16 2023-03-13 17:14:03.293+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: ClusterArn: arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test 2023-03-13 17:14:03.293+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: ContainerInstanceArn: null 2023-03-13 17:14:43.110+0000 [id=6869] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #15 from /10.28.137.77:48708 failed: null 2023-03-13 17:14:43.191+0000 [id=6870] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #16 from /10.28.137.77:48718 2023-03-13 17:14:52.645+0000 [id=29] INFO c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: generic 2023-03-13 17:14:52.645+0000 [id=29] INFO c.c.j.plugins.amazonecs.ECSCloud#provision: Will provision fargate-agent-cloud-generic-zfqhk, for label: generic 2023-03-13 17:15:02.679+0000 [id=38] INFO hudson.slaves.NodeProvisioner#update: fargate-agent-cloud-generic-zfqhk provisioning successfully completed. We have now 3 computer(s) 2023-03-13 17:15:02.715+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-zfqhk]: Starting agent with task definition arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16} 2023-03-13 17:15:03.448+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#runECSTask: [fargate-agent-cloud-generic-zfqhk]: Agent started with task arn : arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817 2023-03-13 17:15:03.448+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: TaskArn: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817 2023-03-13 17:15:03.448+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: TaskDefinitionArn: arn:aws:ecs:eu-west-1:014056181913:task-definition/jenkins-fargate-test-agent-generic:16 2023-03-13 17:15:03.448+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: ClusterArn: arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test 2023-03-13 17:15:03.448+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: ContainerInstanceArn: null 2023-03-13 17:15:18.407+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-211x7]: Task started, waiting for agent to become online 2023-03-13 17:15:18.407+0000 [id=6855] INFO c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-211x7]: Agent connected 2023-03-13 17:15:18.459+0000 [id=6856] INFO c.c.j.p.amazonecs.ECSComputer#taskAccepted: [fargate-agent-cloud-generic-211x7]: JobName: FreeStyle1 2023-03-13 17:15:18.459+0000 [id=6856] INFO c.c.j.p.amazonecs.ECSComputer#taskAccepted: [fargate-agent-cloud-generic-211x7]: JobUrl: job/FreeStyle1/ 2023-03-13 17:16:15.185+0000 [id=6900] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #17 from /10.28.107.11:55170 failed: null 2023-03-13 17:16:15.305+0000 [id=6901] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #18 from /10.28.107.11:55180 2023-03-13 17:16:18.565+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [fargate-agent-cloud-generic-zfqhk]: Task started, waiting for agent to become online 2023-03-13 17:16:18.565+0000 [id=6874] INFO c.c.j.p.amazonecs.ECSLauncher#waitForAgent: [fargate-agent-cloud-generic-zfqhk]: Agent connected 2023-03-13 17:20:56.862+0000 [id=35] INFO h.slaves.CloudRetentionStrategy#check: Disconnecting fargate-agent-cloud-generic-211x7 2023-03-13 17:20:56.862+0000 [id=35] INFO c.c.j.plugins.amazonecs.ECSSlave#_terminate: [fargate-agent-cloud-generic-211x7]: Stopping: TaskArn arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b, ClusterArn arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test 2023-03-13 17:20:56.866+0000 [id=35] INFO c.c.j.p.amazonecs.ECSService#stopTask: Delete ECS agent task: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/07834b1193bf45db8ac88095db2fe31b 2023-03-13 17:20:56.928+0000 [id=35] INFO h.slaves.CloudRetentionStrategy#check: Disconnecting fargate-agent-cloud-generic-zfqhk 2023-03-13 17:20:56.928+0000 [id=35] INFO c.c.j.plugins.amazonecs.ECSSlave#_terminate: [fargate-agent-cloud-generic-zfqhk]: Stopping: TaskArn arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817, ClusterArn arn:aws:ecs:eu-west-1:014056181913:cluster/jenkins-fargate-test 2023-03-13 17:20:56.929+0000 [id=6948] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#62] for fargate-agent-cloud-generic-211x7 terminated: java.nio.channels.ClosedChannelException 2023-03-13 17:20:56.930+0000 [id=35] INFO c.c.j.p.amazonecs.ECSService#stopTask: Delete ECS agent task: arn:aws:ecs:eu-west-1:014056181913:task/jenkins-fargate-test/ea715dd06a4d4b8389773367bb72e817 2023-03-13 17:20:56.985+0000 [id=6949] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#63] for fargate-agent-cloud-generic-zfqhk terminated: java.nio.channels.ClosedChannelException ```
maiconbaum commented 11 months ago

Same problem here.

maiconbaum commented 11 months ago

@helloPiers

Kinda solve this increasing the initialDelay of NodeProvisioner:

{
          name  = "JAVA_OPTS"
          value = "-Dhudson.slaves.NodeProvisioner.initialDelay=60 -Dhudson.slaves.NodeProvisioner.MARGIN=50 -Dhudson.slaves.NodeProvisioner.MARGIN0=0.85"
}

This setting increases the node provisioner delay typically to 600 seconds. Depending on hudson.model.LoadStatistics.clock.

Refer to this link if it helps. Official "documentation" can be found here.

helloPiers commented 11 months ago

Thanks, that's interesting - I hadn't delved that deep into the internal configuration before.

I ended up writing my own plugin, which of course is much easier to do for a single use case that I define for myself, than for a general tool for a mass audience - I don't underestimate the effort of maintaining a public plugin.

(Aside: even making my own simple plugin for Fargate would have been almost impossible given the state of the plugin development documentation, without using this plugin as inspiration.)