jenkinsci / amazon-ecs-plugin

Amazon EC2 Container Service Plugin for Jenkins
https://plugins.jenkins.io/amazon-ecs
MIT License
192 stars 226 forks source link

make launch method thread safe #280

Closed wuillaum closed 1 year ago

wuillaum commented 1 year ago

This PR adds synchronize to the launch method to prevent multiple threads simultaneously spinning up the same computer.

In a worse case scenario, simultaneous calls to launch will cause containers to spin up in ECS that will attempt to connect to Jenkins forever, but will fail with with the following traceback

INFO: [JNLP4-connect connection to redacted-ip] Local headers refused by remote: Unknown client name: redacted-computer-name
Sep 13, 2022 3:01:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: redacted-computer-name
    at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
    at hudson.remoting.Engine.innerRun(Engine.java:799)
    at hudson.remoting.Engine.run(Engine.java:544)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: redacted-computer-name
    at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:380)
    at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:435)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
    at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:289)
    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:168)
    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
    at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
    at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:51)
    at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:257)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:126)
    at java.lang.Thread.run(Thread.java:750)
    Suppressed: java.nio.channels.ClosedChannelException
        ... 7 more

Sep 13, 2022 3:01:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: reconnect rejected, sleeping 10s: 
java.lang.Exception: The server rejected the connection: None of the protocols were accepted
    at hudson.remoting.Engine.onConnectionRejected(Engine.java:878)
    at hudson.remoting.Engine.innerRun(Engine.java:825)
    at hudson.remoting.Engine.run(Engine.java:544)

If we take a look at the logs from Jenkins, we get a better picture of what's going on.

Sep  8 02:24:09 jenkins jenkins[86776]: 2022-09-08 02:24:09.148+0000 [id=10094300]#011INFO#011c.c.j.p.a.p.ECSDeclarativeAgent#getAsArgs: In getAsArgs. label: aJenkinsLabel-116-t03dc
Sep  8 02:24:09 jenkins jenkins[86776]: 2022-09-08 02:24:09.148+0000 [id=10094300]#011INFO#011c.c.j.p.a.p.ECSDeclarativeAgent#getAsArgs: In getAsArgs. argMap: {assignPublicIp=false, cloud=JenkinsCloudName, cpu=1012, image=1111111.dkr.ecr.us-east-1.amazonaws.com/jenkins-agents:simple-agent, label=aJenkinsLabel-116-t03dc, memory=4050, name=aJenkinsLabel-116-t03dc, overrides=[cloud, cpu, image, memory, portMappings, label], portMappings=[PortMappingEntry{containerPort=22, hostPort=0, protocol='tcp}], privileged=false, uniqueRemoteFSRoot=false}
Sep  8 02:24:09 jenkins jenkins[86776]: 2022-09-08 02:24:09.161+0000 [id=10094300]#011INFO#011c.c.j.p.a.p.ECSTaskTemplateStepExecution#start: Registering task template with name aJenkinsLabel-116-t03dc-7xc42
Sep  8 02:24:09 jenkins jenkins[86776]: 2022-09-08 02:24:09.197+0000 [id=10094300]#011INFO#011c.c.j.p.amazonecs.ECSService#findTaskDefinition: No existing task definition found for family or ARN: aJenkinsJob-116-t03dc-7xc42
Sep  8 02:24:09 jenkins jenkins[86776]: 2022-09-08 02:24:09.268+0000 [id=10094300]#011INFO#011c.c.j.p.amazonecs.ECSService#registerTemplate: Created Task Definition: {TaskDefinitionArn: arn:aws:ecs:us-east-1:1111111:task-definition/aJenkinsJob-116-t03dc-7xc42:1,ContainerDefinitions: [{Name: aJenkinsJob-116-t03dc-7xc42,Image: 1111111.dkr.ecr.us-east-1.amazonaws.com/jenkins-agents:simple-agent,Cpu: 1012,Memory: 4050,Links: [],PortMappings: [{ContainerPort: 22,HostPort: 0,Protocol: tcp}],Essential: true,EntryPoint: [],Command: [],Environment: [],EnvironmentFiles: [],MountPoints: [],VolumesFrom: [],Secrets: [],DependsOn: [],Privileged: false,DnsServers: [],DnsSearchDomains: [],ExtraHosts: [],DockerSecurityOptions: [],Ulimits: [],SystemControls: [],ResourceRequirements: [],}],Family: aJenkinsJob-116-t03dc-7xc42,Revision: 1,Volumes: [],Status: ACTIVE,RequiresAttributes: [{Name: com.amazonaws.ecs.capability.ecr-auth,}],PlacementConstraints: [],Compatibilities: [EXTERNAL, EC2],RequiresCompatibilities: [],InferenceAccelerators: [],RegisteredAt: Thu Sep 08 02:24:09 UTC 2022,RegisteredBy: arn:aws:sts::1111111:assumed-role/jenkins/i-0,}
Sep  8 02:24:09 jenkins jenkins[86776]: 2022-09-08 02:24:09.268+0000 [id=10094300]#011INFO#011c.c.j.plugins.amazonecs.ECSCloud#addDynamicTemplate: Task definition created or found: ARN: arn:aws:ecs:us-east-1:1111111:task-definition/aJenkinsJob-116-t03dc-7xc42:1
Sep  8 02:24:15 jenkins jenkins[86776]: 2022-09-08 02:24:15.984+0000 [id=147]#011INFO#011c.c.j.plugins.amazonecs.ECSCloud#provision: Asked to provision 1 agent(s) for: aJenkinsLabel-116-t03dc
Sep  8 02:24:15 jenkins jenkins[86776]: 2022-09-08 02:24:15.985+0000 [id=147]#011INFO#011c.c.j.plugins.amazonecs.ECSCloud#provision: Will provision aJenkinsJob-116-t03dc-th5x5, for label: aJenkinsLabel-116-t03dc
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.074+0000 [id=139]#011INFO#011hudson.slaves.NodeProvisioner#update: aJenkinsJob-116-t03dc-th5x5 provisioning successfully completed. We have now 75 computer(s)

everything is normal so far, note "Asked to provision 1 agent(s)"

Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.095+0000 [id=10094155]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#runECSTask: [aJenkinsJob-116-t03dc-th5x5]: Starting agent with task definition arn:aws:ecs:us-east-1:1111111:task-definition/aJenkinsJob-116-t03dc-7xc42:1}
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.199+0000 [id=10094155]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#runECSTask: [aJenkinsJob-116-t03dc-th5x5]: Agent started with task arn : arn:aws:ecs:us-east-1:1111111:task/ECSClusterName/45bc
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.199+0000 [id=10094155]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: TaskArn: arn:aws:ecs:us-east-1:1111111:task/ECSClusterName/45bc
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.199+0000 [id=10094155]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: TaskDefinitionArn: arn:aws:ecs:us-east-1:1111111:task-definition/aJenkinsJob-116-t03dc-7xc42:1
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.199+0000 [id=10094155]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: ClusterArn: arn:aws:ecs:us-east-1:1111111:cluster/ECSClusterName
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.199+0000 [id=10094155]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: ContainerInstanceArn: arn:aws:ecs:us-east-1:1111111:container-instance/ECSClusterName/7f6c

here we've started a task `45bc` for the label ending in `th5x5`

Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.228+0000 [id=10094150]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#runECSTask: [aJenkinsJob-116-t03dc-th5x5]: Starting agent with task definition arn:aws:ecs:us-east-1:1111111:task-definition/aJenkinsJob-116-t03dc-7xc42:1}
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.349+0000 [id=10094150]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#runECSTask: [aJenkinsJob-116-t03dc-th5x5]: Agent started with task arn : arn:aws:ecs:us-east-1:1111111:task/ECSClusterName/f42e
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.350+0000 [id=10094150]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: TaskArn: arn:aws:ecs:us-east-1:1111111:task/ECSClusterName/f42e
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.350+0000 [id=10094150]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: TaskDefinitionArn: arn:aws:ecs:us-east-1:1111111:task-definition/aJenkinsJob-116-t03dc-7xc42:1
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.350+0000 [id=10094150]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: ClusterArn: arn:aws:ecs:us-east-1:1111111:cluster/ECSClusterName
Sep  8 02:24:26 jenkins jenkins[86776]: 2022-09-08 02:24:26.350+0000 [id=10094150]#011INFO#011c.c.j.p.amazonecs.ECSLauncher#launchECSTask: [aJenkinsJob-116-t03dc-th5x5]: ContainerInstanceArn: arn:aws:ecs:us-east-1:1111111:container-instance/ECSClusterName/36c5

uh oh... just a fraction of a second later we've spun up another task, `f42e`, for the same label `th5x5`
what we've actually done is overwrite the task `45bc` with `f42e`, and now when `45bc` attempts to connect it will forever be rejected by Jenkins.

An identical commit has been made to the kubernetes plugin, it seems that they have suffered from similar issues in the past. https://github.com/jenkinsci/kubernetes-plugin/commit/3e0fd5045caa97962ac499e8f0a87776a26ff0c8

Stericson commented 1 year ago

@wuillaum If this is ready to review, can you move it out of Draft?