jenkinsci / azure-container-agents-plugin

Azure Container Agents Plugin for Jenkins
https://plugins.jenkins.io/azure-container-agents/
MIT License
19 stars 26 forks source link

Many jobs based on the same agent template produce many failed deployment #101

Closed sparsick closed 2 years ago

sparsick commented 2 years ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.319.3 OS: Linux - 5.4.0-1068-azure --- ace-editor:1.1 ant:1.13 antisamy-markup-formatter:2.7 apache-httpcomponents-client-4-api:4.5.13-1.0 authentication-tokens:1.4 azure-container-agents:222.va22f04771da0 azure-credentials:198.vf9c2fdfde55c azure-sdk:85.v4817a_b_8a_7124 blueocean-commons:1.25.2 blueocean-rest:1.25.2 bootstrap4-api:4.6.0-3 bootstrap5-api:5.1.3-4 bouncycastle-api:2.25 branch-api:2.7.0 build-timeout:1.20 caffeine-api:2.9.2-29.v717aac953ff3 checks-api:1.7.2 cloud-stats:0.27 cloudbees-folder:6.17 command-launcher:1.6 copyartifact:1.46.2 credentials:1074.v60e6c29b_b_44b_ credentials-binding:1.27.1 display-url-api:2.3.5 docker-commons:1.19 durable-task:493.v195aefbb0ff2 echarts-api:5.2.2-2 email-ext:2.87 extended-read-permission:3.2 font-awesome-api:5.15.4-5 git:4.10.3 git-client:3.11.0 git-server:1.10 github:1.34.1 github-api:1.301-378.v9807bd746da5 github-branch-source:2.11.4 gradle:1.38 handlebars:3.0.8 jackson2-api:2.13.1-246.va8a9f3eaf46a javax-activation-api:1.2.0-2 javax-mail-api:1.6.2-5 jaxb:2.3.0.1 jdk-tool:1.5 jjwt-api:0.11.2-9.c8b45b8bb173 jnr-posix-api:3.1.7-1 jquery3-api:3.6.0-2 jsch:0.1.55.2 junit:1.54 ldap:2.7 lockable-resources:2.14 mailer:408.vd726a_1130320 matrix-auth:3.0.1 matrix-project:1.20 momentjs:1.1.1 okhttp-api:4.9.3-105.vb96869f8ac3a pam-auth:1.7 pipeline-build-step:2.15 pipeline-github-lib:36.v4c01db_ca_ed16 pipeline-graph-analysis:188.v3a01e7973f2c pipeline-input-step:446.vf27b_0b_83500e pipeline-milestone-step:100.v60a_03cd446e1 pipeline-model-api:1.9.3 pipeline-model-definition:1.9.3 pipeline-model-extensions:1.9.3 pipeline-rest-api:2.21 pipeline-stage-step:291.vf0a8a7aeeb50 pipeline-stage-tags-metadata:1.9.3 pipeline-stage-view:2.21 plain-credentials:1.8 plugin-util-api:2.13.0 popper-api:1.16.1-2 popper2-api:2.11.2-1 resource-disposer:0.17 scm-api:595.vd5a_df5eb_0e39 script-security:1131.v8b_b_5eda_c328e snakeyaml-api:1.29.1 ssh-credentials:1.19 ssh-slaves:1.33.0 sshd:3.1.0 structs:308.v852b473a2b8c timestamper:1.17 token-macro:270.v8c2ce50dc4fc trilead-api:1.0.13 windows-azure-storage:368.v7443dd3deffe workflow-aggregator:2.6 workflow-api:1136.v7f5f1759dc16 workflow-basic-steps:2.24 workflow-cps:2648.va9433432b33c workflow-cps-global-lib:552.vd9cc05b8a2e1 workflow-durable-task-step:1121.va_65b_d2701486 workflow-job:1145.v7f2433caa07f workflow-multibranch:706.vd43c65dec013 workflow-scm-step:2.13 workflow-step-api:622.vb_8e7c15b_c95a_ workflow-support:813.vb_d7c3d2984a_0 ws-cleanup:0.40 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Contoller based on Ubuntu 18.04 Agents are Container Instances in Azure based on following Container Template

      <templates>
        <com.microsoft.jenkins.containeragents.aci.AciContainerTemplate>
          <name>test-private</name>
          <label>test-private</label>
          <image>jenkins/inbound-agent</image>
          <osType>Linux</osType>
          <command>jenkins-agent -url ${rootUrl} ${secret} ${nodeName}</command>
          <rootFs>/home/jenkins</rootFs>
          <privateIpAddress>
            <vnet>jenkins-vnet</vnet>
            <subnet>jenkins-agent-subnet</subnet>
            <resourceGroup>jenkins-vnet-rg</resourceGroup>
          </privateIpAddress>
          <timeout>5</timeout>
          <ports/>
          <cpu>1</cpu>
          <memory>1.5</memory>
          <retentionStrategy class="com.microsoft.jenkins.containeragents.strategy.ContainerOnceRetentionStrategy">
            <idleMinutes>10</idleMinutes>
          </retentionStrategy>
          <envVars/>
          <privateRegistryCredentials/>
          <volumes/>
          <isAvailable>true</isAvailable>
        </com.microsoft.jenkins.containeragents.aci.AciContainerTemplate>

Reproduction steps

  1. Jenkins VM is installed in Azure
  2. Agent Container using Private IP Address Option.
  3. Start parallel several (I took 5) jobs based on following pipeline script

    pipeline {
    agent { label "test-private" }
    
    stages {
        stage('Hello') {
            steps {
                echo 'Hello World'
            }
        }
    }
    }
  4. Wait 5 minutes and then check the logs

Expected Results

No failing deployments and the whole test should not need 5 minutes.

Actual Results

After modified https://github.com/jenkinsci/azure-container-agents-plugin/blob/a22f04771da02f714abab78022efdb38c8bc1fe3/src/main/java/com/microsoft/jenkins/containeragents/aci/AciCloud.java#L133 to

                                LOGGER.log(Level.WARNING, "AciCloud: Provision agent {0} failed: {1}",
                                        new Object[] {agent.getNodeName(), e});
                                e.printStackTrace();

you can see the following information (only snippet) in /var/log/jenkins/jenkins.log

2022-02-11 08:56:31.513+0000 [id=36]    INFO    c.m.j.c.aci.AciCloud#provision: Start ACI container for label test-private workLoad 1
2022-02-11 08:56:31.515+0000 [id=36]    INFO    c.m.j.c.aci.AciCloud#provision: Using ACI Container template: test-private
2022-02-11 08:56:31.524+0000 [id=74]    INFO    c.m.j.c.aci.AciCloud#lambda$provision$1: Add ACI node: test-private-mswkt
2022-02-11 08:56:31.528+0000 [id=36]    INFO    h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning test-private-mswkt from Aci with 1 executors. Remaining excess workload: 0
2022-02-11 08:56:31.913+0000 [id=74]    INFO    c.m.j.c.a.AciCleanTask$DeploymentRegistrar#registerDeployment: AzureAciCleanUpTask: registerDeployment: Registering deployment test-private-zmv9hzln in jenkins-rg
2022-02-11 08:56:32.286+0000 [id=74]    INFO    c.a.c.util.logging.ClientLogger#info: Package versions: jackson-annotations=2.13.1, jackson-core=2.13.1, jackson-databind=2.13.1, jackson-dataformat-xml=2.13.1, jackson-datatype-jsr310=2.13.1, azure-core=1.24.1, Troubleshooting version conflicts: https://aka.ms/azsdk/java/dependency/troubleshoot
2022-02-11 08:56:39.314+0000 [id=90]    INFO    c.a.c.util.logging.ClientLogger#performLogging: Azure Identity => getToken() result for scopes [https://management.core.windows.net//.default]: SUCCESS
2022-02-11 08:56:39.314+0000 [id=90]    INFO    c.a.c.util.logging.ClientLogger#info: Acquired a new access token.
2022-02-11 08:56:39.944+0000 [id=74]    INFO    c.m.j.c.aci.AciService#createDeployment: Waiting for deployment test-private-zmv9hzln
2022-02-11 08:57:11.510+0000 [id=32]    INFO    c.m.j.c.aci.AciCloud#provision: Start ACI container for label test-private workLoad 1
2022-02-11 08:57:11.511+0000 [id=32]    INFO    c.m.j.c.aci.AciCloud#provision: Using ACI Container template: test-private
2022-02-11 08:57:11.511+0000 [id=73]    INFO    c.m.j.c.aci.AciCloud#lambda$provision$1: Add ACI node: test-private-pcsmp
2022-02-11 08:57:11.512+0000 [id=32]    INFO    h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning test-private-pcsmp from Aci with 1 executors. Remaining excess workload: -0.28
2022-02-11 08:57:11.544+0000 [id=73]    INFO    c.m.j.c.a.AciCleanTask$DeploymentRegistrar#registerDeployment: AzureAciCleanUpTask: registerDeployment: Registering deployment test-private-m4cwglm5 in jenkins-rg
2022-02-11 08:57:11.786+0000 [id=73]    INFO    c.m.j.c.aci.AciService#createDeployment: Waiting for deployment test-private-m4cwglm5
2022-02-11 08:57:39.993+0000 [id=95]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #1 failed: java.io.EOFException
2022-02-11 08:57:40.060+0000 [id=96]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #2 from /20.123.8.94:9025
2022-02-11 08:57:41.510+0000 [id=41]    INFO    c.m.j.c.aci.AciCloud#provision: Start ACI container for label test-private workLoad 1
2022-02-11 08:57:41.510+0000 [id=41]    INFO    c.m.j.c.aci.AciCloud#provision: Using ACI Container template: test-private
2022-02-11 08:57:41.511+0000 [id=109]   INFO    c.m.j.c.aci.AciCloud#lambda$provision$1: Add ACI node: test-private-267hx
2022-02-11 08:57:41.511+0000 [id=41]    INFO    h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning test-private-267hx from Aci with 1 executors. Remaining excess workload: -0.391
2022-02-11 08:57:41.534+0000 [id=109]   INFO    c.m.j.c.a.AciCleanTask$DeploymentRegistrar#registerDeployment: AzureAciCleanUpTask: registerDeployment: Registering deployment test-private-fkx4v9fw in jenkins-rg
2022-02-11 08:57:41.764+0000 [id=109]   INFO    c.m.j.c.aci.AciService#createDeployment: Waiting for deployment test-private-fkx4v9fw
2022-02-11 08:57:47.007+0000 [id=174]   INFO    o.j.p.workflow.job.WorkflowRun#finish: test-private #49 completed: SUCCESS
2022-02-11 08:57:56.939+0000 [id=36]    INFO    c.m.j.c.s.ContainerOnceRetentionStrategy#done: terminating test-private-mswkt since PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=test-private#49,label=test-private-mswkt,context=CpsStepContext[3:node]:Owner[test-private/49:test-private #49],cookie=c1b12013-ba81-4716-8597-771889790e9e,auth=null} seems to be finished
2022-02-11 08:57:56.956+0000 [id=140]   INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#52] for test-private-mswkt terminated: java.nio.channels.ClosedChannelException
2022-02-11 08:57:57.134+0000 [id=199]   INFO    c.a.c.util.logging.ClientLogger#performLogging: Azure Identity => getToken() result for scopes [https://management.core.windows.net//.default]: SUCCESS
2022-02-11 08:57:57.134+0000 [id=199]   INFO    c.a.c.util.logging.ClientLogger#info: Acquired a new access token.
2022-02-11 08:57:57.841+0000 [id=145]   INFO    c.m.j.c.aci.AciService#deleteAciContainerGroup: Delete ACI Container Group: test-private-mswkt successfully
2022-02-11 08:58:00.342+0000 [id=74]    WARNING c.m.j.c.aci.AciCloud#lambda$provision$1: AciCloud: Provision agent test-private-mswkt failed: java.lang.Exception: Deployment test-private-zmv9hzln status: Failed
java.lang.Exception: Deployment test-private-zmv9hzln status: Failed
        at com.microsoft.jenkins.containeragents.aci.AciService.createDeployment(AciService.java:65)
        at com.microsoft.jenkins.containeragents.aci.AciContainerTemplate.provisionAgents(AciContainerTemplate.java:130)
        at com.microsoft.jenkins.containeragents.aci.AciCloud.lambda$provision$1(AciCloud.java:113)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2022-02-11 08:58:01.510+0000 [id=35]    WARNING hudson.slaves.NodeProvisioner#update: Unexpected exception encountered while provisioning agent test-private-mswkt
java.lang.Exception: Deployment test-private-zmv9hzln status: Failed
        at com.microsoft.jenkins.containeragents.aci.AciService.createDeployment(AciService.java:65)
        at com.microsoft.jenkins.containeragents.aci.AciContainerTemplate.provisionAgents(AciContainerTemplate.java:130)
        at com.microsoft.jenkins.containeragents.aci.AciCloud.lambda$provision$1(AciCloud.java:113)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
.....
2022-02-11 09:00:21.512+0000 [id=32]    INFO    h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning test-private-n4s69 from Aci with 1 executors. Remaining excess workload: 0
2022-02-11 09:00:21.528+0000 [id=74]    INFO    c.m.j.c.a.AciCleanTask$DeploymentRegistrar#registerDeployment: AzureAciCleanUpTask: registerDeployment: Registering deployment test-private-hcqp2z3p in jenkins-rg
2022-02-11 09:00:21.846+0000 [id=74]    INFO    c.m.j.c.aci.AciService#createDeployment: Waiting for deployment test-private-hcqp2z3p
2022-02-11 09:01:28.549+0000 [id=264]   INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #9 failed: java.io.EOFException
2022-02-11 09:01:28.657+0000 [id=265]   INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #10 from /13.79.74.203:22465
2022-02-11 09:01:36.318+0000 [id=303]   INFO    o.j.p.workflow.job.WorkflowRun#finish: test-private #50 completed: SUCCESS
2022-02-11 09:01:42.578+0000 [id=74]    INFO    c.m.j.c.aci.AciService#createDeployment: Deployment test-private-hcqp2z3p succeed
2022-02-11 09:01:42.579+0000 [id=74]    INFO    c.m.j.c.aci.AciCloud#waitToOnline: Waiting agent test-private-n4s69 to online
2022-02-11 09:01:42.674+0000 [id=74]    WARNING c.m.j.c.aci.AciCloud#lambda$provision$1: AciCloud: Provision agent test-private-n4s69 failed: java.lang.NullPointerException
java.lang.NullPointerException
        at com.azure.resourcemanager.containerinstance.implementation.ContainerGroupImpl.initializeChildrenFromInner(ContainerGroupImpl.java:217)
        at com.azure.resourcemanager.resources.fluentcore.arm.models.implementation.GroupableParentResourceImpl.<init>(GroupableParentResourceImpl.java:32)
        at com.azure.resourcemanager.containerinstance.implementation.ContainerGroupImpl.<init>(ContainerGroupImpl.java:73)
        at com.azure.resourcemanager.containerinstance.implementation.ContainerGroupsImpl.wrapModel(ContainerGroupsImpl.java:43)
        at com.azure.resourcemanager.containerinstance.implementation.ContainerGroupsImpl.wrapModel(ContainerGroupsImpl.java:24)
        at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151)
        at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
        at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
        at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
        at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
        at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
        at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2398)
        at reactor.core.publisher.MonoFlatMap$FlatMapInner.onSubscribe(MonoFlatMap.java:238)
        at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)

On Azure Monitor page, you can see that the affected deployments were canceled.

Anything else?

No response

sparsick commented 2 years ago

When I use public ip address in the agent template, I can't see this problem.

sparsick commented 2 years ago

The NullPointerException results that the plugin terminates fully connected Azure container instances. I can't find why the Azure Resource Manager lib throw a NullPointerException. At the initial draft, I would add a workaround to ignoring this NullPointerException to check whether it solve the bug.

timja commented 2 years ago

Can you report it to azure sdk for java https://github.com/Azure/azure-sdk-for-java/ and let me know the issue?

They are quite responsive generally

sparsick commented 2 years ago

@timja Done. I will prepare a workaround for that issue.

sparsick commented 2 years ago

I found two places where the NullPointerException is thrown: