jenkinsci / azure-container-agents-plugin

Azure Container Agents Plugin for Jenkins
https://plugins.jenkins.io/azure-container-agents/
MIT License
19 stars 26 forks source link

Container instances never end up running #18

Open atrauzzi opened 6 years ago

atrauzzi commented 6 years ago

I'm trying out the ACI based cloud but am encountering an issue with both Windows and Linux based containers. While I do end up seeing that the plugin attempts to create instances, my tasks never seem to run. Instead, the instances linger for a while and then disappear, having done nothing.

Feedback on the containers both in the portal and via az seems sparse:

image

Log of container 'linux-w4gcn' in container group 'linux-w4gcn' is not available yet. Please check container 'InstanceView' for more info
rmation or retry later.

Looking at the Jenkins logs, I do see one exception per type if I leave things to run long enough:

WARNING: AciCloud: Provision agent linux-kbt38 failed: ACI container terminated
[pool-266-thread-1] INFO com.microsoft.aad.adal4j.AuthenticationAuthority - [Correlation ID: REDACTED] Instance discovery was successful
Dec 06, 2017 2:15:57 PM okhttp3.internal.platform.Platform log
INFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
Dec 06, 2017 2:15:57 PM hudson.slaves.NodeProvisioner$2 run
WARNING: Unexpected exception encountered while provisioning agent Linux
java.lang.Exception: java.lang.IllegalStateException: ACI container terminated
        at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:131)
        at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:80)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: ACI container terminated
        at com.microsoft.jenkins.containeragents.aci.AciCloud.waitToOnline(AciCloud.java:201)
        at com.microsoft.jenkins.containeragents.aci.AciCloud.access$200(AciCloud.java:42)
        at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:107)
        ... 6 more
WARNING: Unexpected exception encountered while provisioning agent NET Framework
java.lang.Exception: java.lang.Exception: Deployment timeout
        at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:131)
        at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:80)
        at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Deployment timeout
        at com.microsoft.jenkins.containeragents.aci.AciService.createDeployment(AciService.java:122)
        at com.microsoft.jenkins.containeragents.aci.AciContainerTemplate.provisionAgents(AciContainerTemplate.java:107)
        at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:104)
        ... 6 more

Nothing fancy about my templates, just trying the defaults for now to make sure everything works:

image

image

zackliu commented 6 years ago

you run az container show combined with az container logs to get more details. The initialization failed will produce this problem, but Azure will have two different behaviors to initialization failed. One is deployment shows successful and container shows terminated, the other is endless deployment.

raphaelquati commented 6 years ago

I'm tryiing to use my docker image as a JNLP agent (windows). Same problem. But the problem is not the plugin. Is Azure itself

In Linux it is working.

zackliu commented 6 years ago

Seems it's a connection issue, but I can't tell why connection timed out. In the new version(0.4.0), we support SSH connection. You can have a try.

jamesabbottsmith commented 6 years ago

I've tried this with an SSH connection. I see similar behaviour. The container is started and appears to run, but exits before doing anything useful.

$ az container logs --name ...
Linux caas-82e05980240a429795153b976151f58a-2347572569-kqlfb 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC 2017 x86_64 GNU/Linux

I also see the ALPN callback dropped message in the logs

ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path?
zackliu commented 6 years ago

@jamesabbottsmith Can you provide more logs about SSHLauncher in Jenkins logs.

Such as:

Jan 16, 2018 6:16:32 AM INFO com.microsoft.jenkins.containeragents.remote.SSHLauncher launch SSHLauncher: Start to connect node acissh-g0w3t : 52.224.69.226 via SSH Jan 16, 2018 6:16:44 AM INFO com.microsoft.jenkins.containeragents.remote.SSHLauncher launch SSHLauncher: Copy slave.jar to remote host successfully

atrauzzi commented 6 years ago

I've noticed a lot of issues with ACI even in just manually creating containers. It seems like the service has a lot of performance and consistency issues. Definitely to the extent that a plugin like this wouldn't make sense until the start time of any Windows or Linux container can be brought to under ~15 seconds.

arjanschaaf commented 6 years ago

I'm trying to setup an Jenkins deployment using the Solution Template together with an ACI based build environment. I'm running into very similar problems with the this setup and it is impossible to debug. az container show and az container logs are impossible to use because the containers are gone before you know it.

jenkins.log shows many entries like these:

Jan 26, 2018 1:01:10 PM com.microsoft.jenkins.containeragents.aci.AciCloud provision
INFO: Start ACI container for label ACI-container workLoad 1
Jan 26, 2018 1:01:10 PM com.microsoft.jenkins.containeragents.aci.AciCloud provision
INFO: Using ACI Container template: ACI-container
Jan 26, 2018 1:01:10 PM hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
INFO: Started provisioning ACI-container from AciAgents with 1 executors. Remaining excess workload: 0
Jan 26, 2018 1:01:10 PM com.microsoft.jenkins.containeragents.aci.AciCloud$1 call
INFO: Add ACI node: aci-container-8q2gl
Jan 26, 2018 1:01:10 PM okhttp3.internal.platform.Platform log
INFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
Jan 26, 2018 1:01:12 PM com.microsoft.jenkins.containeragents.aci.AciService createDeployment
INFO: Waiting for deployment aci-container-q2m0kn9b
Jan 26, 2018 1:01:22 PM com.microsoft.jenkins.containeragents.aci.AciCloud$1 call
WARNING: AciCloud: Provision agent aci-container-8q2gl failed: Deployment aci-container-q2m0kn9b status: Failed
Jan 26, 2018 1:01:22 PM okhttp3.internal.platform.Platform log
INFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
Jan 26, 2018 1:01:22 PM com.microsoft.jenkins.containeragents.aci.AciService deleteAciContainerGroup
INFO: Delete ACI Container Group: aci-container-8q2gl successfully
Jan 26, 2018 1:01:30 PM hudson.slaves.NodeProvisioner$2 run
WARNING: Unexpected exception encountered while provisioning agent ACI-container
java.lang.Exception: java.lang.Exception: Deployment aci-container-q2m0kn9b status: Failed
    at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:136)
    at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:80)
    at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Deployment aci-container-q2m0kn9b status: Failed
    at com.microsoft.jenkins.containeragents.aci.AciService.createDeployment(AciService.java:137)
    at com.microsoft.jenkins.containeragents.aci.AciContainerTemplate.provisionAgents(AciContainerTemplate.java:128)
    at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:104)
    ... 6 more
arjanschaaf commented 6 years ago

So I'm finding my way around the Azure Portal and I finally found some relevant error feedback in the Deployments view of my ResourceGroup. Turned out I needed to issue this command and then it started working!! az provider register --namespace Microsoft.ContainerInstance

chenkennt commented 6 years ago

@atrauzzi We recommend to use SSH to launch container as it's much more stable than JNLP. Please have a try and let us know if there is any issue.

atrauzzi commented 6 years ago

I don't have SSH access to launch nodes on certain services. So this is still an issue.

Why did you close this issue before I could reply?

zackliu commented 6 years ago

@atrauzzi Please provide more logs and configurations about using SSH to launch ACI as you haven't succeeded. From our test, SSH is more stable than JNLP. Please keep in contact or we don't know what going on now.

ravee-rachakonda-by commented 4 years ago

I am facing similar issue with jenkins/ssh-slave setup with this plugin. I see the agent gets deployed in ACI but gets deleted and another instance keeps coming up.

2020-03-29T17:18:15.716749905Z: [ERROR] 2020-03-29 17:18:15.716+0000 [id=112036] INFO c.m.j.c.aci.AciService#createDeployment: Deployment ztest-r3tkt376 succeed 2020-03-29T17:18:15.971375792Z: [ERROR] 2020-03-29 17:18:15.971+0000 [id=112036] INFO o.internal.platform.Platform#log: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path? 2020-03-29T17:18:16.119031025Z: [ERROR] 2020-03-29 17:18:16.114+0000 [id=112209] INFO c.m.j.c.remote.SSHLauncher#launch: SSHLauncher: Start to connect node ztest-kctsj : 52.152.198.206 via SSH

2020-03-29T17:18:16.422975930Z: [ERROR] 2020-03-29 17:18:16.422+0000 [id=112209] INFO c.m.j.c.remote.SSHLauncher#launch: SSHLauncher: Copy slave.jar to remote host successfully 2020-03-29T17:18:16.605629524Z: [ERROR] 2020-03-29 17:18:16.605+0000 [id=112209] INFO c.m.j.c.remote.SSHLauncher#launch: SSHLauncher: launching agent failed 2020-03-29T17:18:16.606118343Z: [ERROR] 2020-03-29 17:18:16.605+0000 [id=112036] WARNING c.m.j.c.aci.AciCloud$1#call: AciCloud: Provision agent ztest-kctsj failed: java.lang.InterruptedException: java.io.EOFException: unexpected stream termination 2020-03-29T17:18:17.050554205Z: [ERROR] 2020-03-29 17:18:17.050+0000 [id=112036] INFO o.j.p.cloudstats.CloudStatistics#getIdFor: No support for cloud-stats-plugin by class com.microsoft.jenkins.containeragents.aci.AciAgent

2020-03-29T17:18:17.241691129Z: [ERROR] 2020-03-29 17:18:17.241+0000 [id=112215] INFO o.internal.platform.Platform#log: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path? 2020-03-29T17:18:18.162858608Z: [ERROR] 2020-03-29 17:18:18.161+0000 [id=112209] INFO c.m.j.c.aci.AciService#deleteAciContainerGroup: Delete ACI Container Group: ztest-kctsj successfully

2020-03-29T17:18:24.527343702Z: [ERROR] 2020-03-29 17:18:24.526+0000 [id=88] WARNING hudson.slaves.NodeProvisioner#lambda$update$6: Unexpected exception encountered while provisioning agent ztest 2020-03-29T17:18:24.527390804Z: [ERROR] java.lang.InterruptedException: java.io.EOFException: unexpected stream termination 2020-03-29T17:18:24.527398804Z: [ERROR] at com.microsoft.jenkins.containeragents.remote.SSHLauncher.launch(SSHLauncher.java:121) 2020-03-29T17:18:24.527403904Z: [ERROR] at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:292) 2020-03-29T17:18:24.527409204Z: [ERROR] Caused: java.util.concurrent.ExecutionException 2020-03-29T17:18:24.527414405Z: [ERROR] at java.util.concurrent.FutureTask.report(FutureTask.java:122) 2020-03-29T17:18:24.527419405Z: [ERROR] at java.util.concurrent.FutureTask.get(FutureTask.java:192) 2020-03-29T17:18:24.527424205Z: [ERROR] at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:115) 2020-03-29T17:18:24.527429005Z: [ERROR] Caused: java.lang.Exception 2020-03-29T17:18:24.527433705Z: [ERROR] at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:140) 2020-03-29T17:18:24.527438606Z: [ERROR] at com.microsoft.jenkins.containeragents.aci.AciCloud$1.call(AciCloud.java:80) 2020-03-29T17:18:24.527443306Z: [ERROR] at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) 2020-03-29T17:18:24.527448206Z: [ERROR] at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) 2020-03-29T17:18:24.527452906Z: [ERROR] at java.util.concurrent.FutureTask.run(FutureTask.java:266) 2020-03-29T17:18:24.527457606Z: [ERROR] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2020-03-29T17:18:24.527462407Z: [ERROR] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2020-03-29T17:18:24.527467207Z: [ERROR] at java.lang.Thread.run(Thread.java:748) 2020-03-29T17:18:24.527471907Z: [ERROR] 2020-03-29 17:18:24.527+0000 [id=88] WARNING c.m.j.c.aci.AciCloud#canProvision: Cannot provision: template for label ztest is not available now, because it failed to provision last time.

2020-03-29T17:18:28.893307318Z: [ERROR] 2020-03-29 17:18:28.893+0000 [id=112245] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started DockerContainerWatchdog Asynchronous Periodic Work 2020-03-29T17:18:28.893558628Z: [ERROR] 2020-03-29 17:18:28.893+0000 [id=112245] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered 2020-03-29T17:18:28.893581829Z: [ERROR] 2020-03-29 17:18:28.893+0000 [id=112245] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 571, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms 2020-03-29T17:18:28.893631631Z: [ERROR] 2020-03-29 17:18:28.893+0000 [id=112245] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 2 nodes assigned to this Jenkins instance, which we will check 2020-03-29T17:18:28.893644931Z: [ERROR] 2020-03-29 17:18:28.893+0000 [id=112245] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed 2020-03-29T17:18:28.893813838Z: [ERROR] 2020-03-29 17:18:28.893+0000 [id=112245] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished DockerContainerWatchdog Asynchronous Periodic Work. 0 ms

2020-03-29T17:18:34.526268101Z: [ERROR] 2020-03-29 17:18:34.525+0000 [id=96] INFO c.m.j.c.aci.AciCloud#provision: Start ACI container for label ztest workLoad 1 2020-03-29T17:18:34.526304302Z: [ERROR] 2020-03-29 17:18:34.526+0000 [id=96] INFO c.m.j.c.aci.AciCloud#provision: Using ACI Container template: ztest 2020-03-29T17:18:34.526324603Z: [ERROR] 2020-03-29 17:18:34.526+0000 [id=96] INFO h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning ztest from SaaSBld with 1 executors. Remaining excess workload: 0 2020-03-29T17:18:34.526693417Z: [ERROR] 2020-03-29 17:18:34.526+0000 [id=112036] INFO c.m.j.c.aci.AciCloud$1#call: Add ACI node: ztest-3qs59 2020-03-29T17:18:34.538040658Z: [ERROR] 2020-03-29 17:18:34.537+0000 [id=112036] INFO c.m.j.c.a.AciCleanTask$DeploymentRegistrar#registerDeployment: AzureAciCleanUpTask: registerDeployment: Registering deployment ztest-6rmxcrfl in saas-*****-01 2020-03-29T17:18:34.853962320Z: [ERROR] 2020-03-29 17:18:34.853+0000 [id=112036] INFO o.internal.platform.Platform#log: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?

2020-03-29T17:18:35.578729453Z: [ERROR] 2020-03-29 17:18:35.578+0000 [id=112036] INFO c.m.j.c.aci.AciService#createDeployment: Waiting for deployment ztest-6rmxcrfl

ravee-rachakonda-by commented 4 years ago

Logs from the container before it dies out

TX1-XX:aci$ az container logs --name ztest-23n25 --resource-group XX-01 --subscription XX