jenkinsci / azure-vm-agents-plugin

This repo is for azure vm agents plugin for jenkins. Azure devops CICD is the team which owns it for now
https://plugins.jenkins.io/azure-vm-agents/
45 stars 101 forks source link

Agents sometimes left as suspended after starting from shutdown state #223

Open timja opened 5 years ago

timja commented 5 years ago

When using idle retention strategy and setting shutting down only option, sometimes the agents may not be able to connect successfully. And it will stay as suspended all the time.

 

Some useful log from @Sander Bel  ,

 

2019-05-09 09:20:05.878+0000 [id=1060]  FINE    hudson.model.AsyncPeriodicWork$1#run: Started Azure VM Maintainer Pool Size
2019-05-09 09:20:05.879+0000 [id=1060]  FINE    hudson.model.AsyncPeriodicWork$1#run: Finished Azure VM Maintainer Pool Size. 1 ms
2019-05-09 09:22:05.560+0000 [id=1165]  FINE    hudson.model.AsyncPeriodicWork$1#run: Started Azure VM Agents Clean Task
2019-05-09 09:22:05.560+0000 [id=1165]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#execute: AzureVMAgentCleanUpTask: execute: start
2019-05-09 09:22:05.561+0000 [id=1165]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#execute: AzureVMAgentCleanUpTask: execute: Running clean with 5 minute timeout
2019-05-09 09:22:05.561+0000 [id=1166]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanVMs: AzureVMAgentCleanUpTask: cleanVMs: node ubuntu-test3c8d10 blocked to cleanup
2019-05-09 09:22:05.561+0000 [id=1166]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments
2019-05-09 09:22:05.561+0000 [id=1166]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Checking deployment ubuntu-test0509110640988
2019-05-09 09:22:07.340+0000 [id=1166]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Deployment created on 5/9/19 11:08 AM
2019-05-09 09:22:07.340+0000 [id=1166]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Deployment newer than timeout, keeping
2019-05-09 09:22:07.340+0000 [id=1166]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments
2019-05-09 09:22:07.439+0000 [id=1165]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#execute: AzureVMAgentCleanUpTask: execute: end
2019-05-09 09:22:07.439+0000 [id=1165]  FINE    hudson.model.AsyncPeriodicWork$1#run: Finished Azure VM Agents Clean Task. 1,879 ms
2019-05-09 09:22:54.407+0000 [id=104]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label mdb-sandbox-acs
2019-05-09 09:22:54.407+0000 [id=104]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template ubuntu-test
2019-05-09 09:22:54.407+0000 [id=104]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template test-jenkins-ubuntu
2019-05-09 09:22:54.407+0000 [id=104]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template acostest
2019-05-09 09:23:13.239+0000 [id=103]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label chef
2019-05-09 09:23:13.239+0000 [id=103]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template ubuntu-test
2019-05-09 09:23:13.239+0000 [id=103]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: ubuntu-test matches!
2019-05-09 09:23:13.239+0000 [id=103]   INFO    c.m.azure.vmagent.AzureVMCloud#provision: AzureVMCloud: provision: start for label chef workLoad 1
2019-05-09 09:23:13.240+0000 [id=103]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label chef
2019-05-09 09:23:13.240+0000 [id=103]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template ubuntu-test
2019-05-09 09:23:13.240+0000 [id=103]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: ubuntu-test matches!
2019-05-09 09:23:13.240+0000 [id=103]   INFO    c.m.azure.vmagent.AzureVMCloud#provision: AzureVMCloud: provision: checking for node reuse options
2019-05-09 09:23:13.240+0000 [id=103]   INFO    c.m.azure.vmagent.AzureVMCloud#provision: AzureVMCloud: provision: agent computer eligible for reuse ubuntu-test3c8d10
2019-05-09 09:23:13.240+0000 [id=103]   INFO    c.m.a.v.AzureVMManagementServiceDelegate#virtualMachineExists: AzureVMManagementServiceDelegate: virtualMachineExists: check for ubuntu-test3c8d10
2019-05-09 09:23:14.168+0000 [id=103]   INFO    c.m.a.v.AzureVMManagementServiceDelegate#virtualMachineExists: AzureVMManagementServiceDelegate: virtualMachineExists: ubuntu-test3c8d10 exists
2019-05-09 09:23:14.169+0000 [id=103]   INFO    c.m.azure.vmagent.AzureVMCloud#provision: AzureVMCloud: provision: asynchronous provision finished, returning 1 planned node(s)
2019-05-09 09:23:14.169+0000 [id=930]   INFO    c.m.azure.vmagent.AzureVMCloud#lambda$provision$0: Found existing node, starting VM ubuntu-test3c8d10
2019-05-09 09:23:14.170+0000 [id=930]   INFO    c.m.a.v.AzureVMManagementServiceDelegate#startVirtualMachine: AzureVMManagementServiceDelegate: startVirtualMachine: ubuntu-test3c8d10
2019-05-09 09:24:03.370+0000 [id=930]   INFO    c.m.a.v.AzureVMManagementServiceDelegate#setVirtualMachineDetails: The Azure agent doesn't have a public IP. Will use the private IP
2019-05-09 09:24:03.370+0000 [id=930]   INFO    c.m.a.v.AzureVMManagementServiceDelegate#setVirtualMachineDetails: Azure agent details:
nodeNameubuntu-test3c8d10
adminUserName=jenkins_node
shutdownOnIdle=true
retentionTimeInMin=0
labels=chef
2019-05-09 09:24:03.370+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#launch: AzureVMAgentSSHLauncher: launch: launch method called for agent ubuntu-test3c8d10
2019-05-09 09:24:05.920+0000 [id=1348]  INFO    c.m.a.v.AzureVMManagementServiceDelegate#isVMAliveOrHealthy: AzureVMManagementServiceDelegate: isVMAliveOrHealthy: status PowerState/running
2019-05-09 09:24:05.921+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#connectToSsh: AzureVMAgentSSHLauncher: connectToSsh: start
2019-05-09 09:24:05.921+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#getRemoteSession: AzureVMAgentSSHLauncher: getRemoteSession: getting remote session for user jenkins to host 10.76.96.77:22
2019-05-09 09:24:37.558+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#getRemoteSession: AzureVMAgentSSHLauncher: getRemoteSession: Got remote session for user jenkins to host 10.76.96.77:22
2019-05-09 09:24:37.559+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#connectToSsh: AzureVMAgentSSHLauncher: connectToSsh: Got remote connection
2019-05-09 09:24:37.559+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#executeRemoteCommand: AzureVMAgentSSHLauncher: executeRemoteCommand: starting test -e ~/.azure-agent-init
2019-05-09 09:24:38.811+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#executeRemoteCommand: AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully
2019-05-09 09:24:38.811+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#launch: AzureVMAgentSSHLauncher: launch: checking for java runtime
2019-05-09 09:24:38.811+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#executeRemoteCommand: AzureVMAgentSSHLauncher: executeRemoteCommand: starting java -fullversion
2019-05-09 09:24:40.152+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#executeRemoteCommand: AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully
2019-05-09 09:24:40.152+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#launch: AzureVMAgentSSHLauncher: launch: java runtime present, copying slave.jar to remote
2019-05-09 09:24:40.154+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#copyFileToRemote: AzureVMAgentSSHLauncher: copyFileToRemote: Initiating file transfer to slave.jar
2019-05-09 09:24:51.720+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#copyFileToRemote: AzureVMAgentSSHLauncher: copyFileToRemote: copied file Successfully to slave.jar
2019-05-09 09:24:51.721+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#launch: AzureVMAgentSSHLauncher: launch: launching agent: java  -jar slave.jar
2019-05-09 09:24:51.742+0000 [id=1348]  INFO    c.m.a.v.r.AzureVMAgentSSHLauncher#launch: AzureVMAgentSSHLauncher: launch: Connected successfully
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label chef2
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template ubuntu-test
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template test-jenkins-ubuntu
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: test-jenkins-ubuntu matches!
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label azure-acos-atlasco
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template ubuntu-test
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template test-jenkins-ubuntu
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template acostest
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: acostest matches!
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label mdb-sandbox-acs
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template ubuntu-test
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template test-jenkins-ubuntu
2019-05-09 09:24:56.983+0000 [id=97]    FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template acostest
2019-05-09 09:25:05.878+0000 [id=1474]  FINE    hudson.model.AsyncPeriodicWork$1#run: Started Azure VM Maintainer Pool Size
2019-05-09 09:25:05.879+0000 [id=1474]  FINE    hudson.model.AsyncPeriodicWork$1#run: Finished Azure VM Maintainer Pool Size. 1 ms
2019-05-09 09:25:13.239+0000 [id=101]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Retrieving agent template with label mdb-sandbox-acs
2019-05-09 09:25:13.239+0000 [id=101]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template ubuntu-test
2019-05-09 09:25:13.239+0000 [id=101]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template test-jenkins-ubuntu
2019-05-09 09:25:13.239+0000 [id=101]   FINE    c.m.azure.vmagent.AzureVMCloud#getAzureAgentTemplate: AzureVMCloud: getAzureAgentTemplate: Found agent template acostest
2019-05-09 09:27:05.560+0000 [id=1812]  FINE    hudson.model.AsyncPeriodicWork$1#run: Started Azure VM Agents Clean Task
2019-05-09 09:27:05.561+0000 [id=1812]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#execute: AzureVMAgentCleanUpTask: execute: start
2019-05-09 09:27:05.561+0000 [id=1812]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#execute: AzureVMAgentCleanUpTask: execute: Running clean with 5 minute timeout
2019-05-09 09:27:05.561+0000 [id=1813]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments
2019-05-09 09:27:05.561+0000 [id=1813]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Checking deployment ubuntu-test0509110640988
2019-05-09 09:27:07.711+0000 [id=1813]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Deployment created on 5/9/19 11:08 AM
2019-05-09 09:27:07.711+0000 [id=1813]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Deployment newer than timeout, keeping
2019-05-09 09:27:07.712+0000 [id=1813]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#cleanDeployments: AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments
2019-05-09 09:27:07.800+0000 [id=1812]  FINE    c.m.a.v.AzureVMAgentCleanUpTask#execute: AzureVMAgentCleanUpTask: execute: end
2019-05-09 09:27:07.800+0000 [id=1812]  FINE    hudson.model.AsyncPeriodicWork$1#run: Finished Azure VM Agents Clean Task. 2,240 ms
2019-05-09 09:28:46.078+0000 [id=101]   INFO    c.m.a.v.AzureVMCloudRetensionStrategy#check: AzureVMCloudRetensionStrategy: check: Idle timeout reached for agent: ubuntu-test3c8d10, action: shutdown
2019-05-09 09:28:46.079+0000 [id=1967]  INFO    c.m.a.v.AzureVMCloudRetensionStrategy$1#call: AzureVMCloudRetensionStrategy: going to idleTimeout

 

[JENKINS-57447] created by jieshe

timja commented 5 years ago

This issue seems to be caused by https://github.com/jenkinsci/azure-vm-agents-plugin/blob/d83f4153627fc6f64c166217bac59e27876ee984/src/main/java/com/microsoft/azure/vmagent/remote/AzureVMAgentSSHLauncher.java#L250 . This method seems to be hanged forever and the below logs do not be triggered. Need more investigation on this because this method will call Jenkins core code.

by jieshe

timja commented 5 years ago

Hi Jie, do you have an update?

by bel_sander

timja commented 5 years ago

My assumption here seems to be wrong. This issue seems not be hung on the setChannel method. Any help or suggestion here will be welcome.

by jieshe

timja commented 5 years ago

Jie Shen I have the issue you describe.

We have agents configured to shutdown after some idle time. When agent need to be reused, it come back online but never goes out suspended state. It appears from thread dump that it's blocked in https://github.com/jenkinsci/azure-vm-agents-plugin/blob/d83f4153627fc6f64c166217bac59e27876ee984/src/main/java/com/microsoft/azure/vmagent/remote/AzureVMAgentSSHLauncher.java#L250 calling https://github.com/jenkinsci/branch-api-plugin/blob/branch-api-2.5.4/src/main/java/jenkins/branch/WorkspaceLocatorImpl.java#L534:

 Thread 31073: (state = BLOCKED) 
 - jenkins.branch.WorkspaceLocatorImpl$Collector.onOnline(hudson.model.Computer, hudson.model.TaskListener) @bci=27, line=534 (Interpreted frame) 
 - hudson.slaves.SlaveComputer.setChannel(hudson.remoting.Channel, java.io.OutputStream, hudson.remoting.Channel$Listener) @bci=662, line=698 (Interpreted frame) 
 - hudson.slaves.SlaveComputer.setChannel(java.io.InputStream, java.io.OutputStream, java.io.OutputStream, hudson.remoting.Channel$Listener) @bci=82, line=432 (Interpreted frame) 
 - com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.launch(hudson.slaves.SlaveComputer, hudson.model.TaskListener) @bci=977, line=250 (Interpreted frame) 
 - hudson.slaves.SlaveComputer$1.call() @bci=88, line=294 (Interpreted frame) 
 - jenkins.util.ContextResettingExecutorService$2.call() @bci=18, line=46 (Compiled frame) 
 - jenkins.security.ImpersonatingExecutorService$2.call() @bci=17, line=71 (Interpreted frame) 
 - java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame) 
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Interpreted frame) 
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Interpreted frame) 
 - java.lang.Thread.run() @bci=11, line=748 (Interpreted frame) 

 

by michaelburtin