jenkinsci / azure-vm-agents-plugin

This repo is for azure vm agents plugin for jenkins. Azure devops CICD is the team which owns it for now
https://plugins.jenkins.io/azure-vm-agents/
43 stars 96 forks source link

VM failed to provision when use Pool Retention Strategy #225

Open timja opened 5 years ago

timja commented 5 years ago

When use pool retention strategy, most times VMs are failed to provision.

If use the same template settings, only change retention strategy to Idle retention strategy, then all vms can provisioned successfully(100%). 

Retention time in hour: 24

Pool Size: 1

Only a few times, VMs can be provisioned successfully with pool retention strategy.

Node ProvisioningActivity for Azure-Test/jenkins-…

java.lang.Exception: Node ProvisioningActivity for Azure-Test/jenkins-test/null (229415501) has lost. Mark as failure at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanCloudStatistics(AzureVMAgentCleanUpTask.java:604) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.clean(AzureVMAgentCleanUpTask.java:622) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.access$200(AzureVMAgentCleanUpTask.java:73) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$3.call(AzureVMAgentCleanUpTask.java:632) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$3.call(AzureVMAgentCleanUpTask.java:629) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

[JENKINS-56981] created by diwang

timja commented 5 years ago

Hi di wang, do you mean no VM could be created successfully with your settings? I just check it in my environment, it seems to work well for me. Your log is just for clean up action, could you please provide the reason for its failed deployment?

by jieshe

timja commented 5 years ago

The VM can not be created when use pool retention strategy as you can see in pic1. All of them failed with the same error as posted in the ticket above.

 

I also attached my settings pic2 to pic5. With this setting("idle retention strategy" in pic3), all VMs can created successfully triggered by label "m1".

 

However, if I change retention strategy to "pool retention strategy" (pic6), then it starts failed to create VMs. 

 

Please let me know if you need any additional data, or please suggest any instructions on how to troubleshoot. 

 

by diwang

timja commented 5 years ago

attached the errors that I just tried today.

Observe it for an hour now. It looks like it created VM first, but shut it down after 1 or 2 minutes. I don't know why it shut the new VM down as it suppose to keep the new VM to maintain the pool. Maybe because of above error? 

Anyway, eventually no VMs can be created after switch to pool retention strategy. 

 

by diwang

timja commented 5 years ago

Even though I can use pool retention strategy to provision an agent and successfully run a job. But its behavior is not what I excepted. It cannot maintain a healthy pool of agents for my further jobs. Need more investigation on this strategy. 

by jieshe

timja commented 5 years ago

Jie Shen do you have any update on this issue?

by diwang

timja commented 5 years ago

I have made a  PR  to fix one bug for pool retention strategy. But the your issue seems not related with it. I am afraid the advanced configuration parameters caused the provisioned agents into improper status. And this leads the clean up task to delete them.

by jieshe

timja commented 5 years ago

Are you able to reproduce the issue? Do you need any other information?

by diwang

timja commented 5 years ago

di wang are you able to build a SNAPSHOT version with the above PR checked in? If so, please verify it in your Jenkins. If not, you can install the 0.7.5 (https://repo.jenkins-ci.org/releases/org/jenkins-ci/plugins/azure-vm-agents/0.7.5/) version which does not contain the bug. Instruction at https://jenkins.io/doc/book/managing/plugins/#advanced-installation . The ProvisioningActivity error log seems to only affect cloud statics, so it will not prevent provisioning VMs.

by jieshe

timja commented 5 years ago

Built a snapshot plugin based on https://github.com/jenkinsci/azure-vm-agents-plugin/commit/82829b83854244c8e203e67fff7689056a8fd515

Upload the azure-vm-agents.hpi, then restart Jenkins. Check the new pic I uploaded. 

It is the same error. See new attached pic. 

From jenkins log, it looks like it thinks it provisioned successfully, but it actually didn't. The latest provision failed but no more attempts after it.  Check attached new pic.

 

Apr 30, 2019 4:17:09 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate createDeployment

Apr 30, 2019 4:17:09 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate createDeploymentINFO: AzureVMManagementServiceDelegate: createDeployment: Initializing deployment for agentTemplate test-vm

Apr 30, 2019 4:17:09 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate createDeploymentINFO: AzureVMManagementServiceDelegate: createDeployment: Creating a new deployment test-vm0430161709407 with VM base name test-vm40794

Apr 30, 2019 4:17:10 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate createDeploymentINFO: AzureVMManagementServiceDelegate: createDeployment: Use embedded deployment template (with managed) /referenceImageIDTemplateWithManagedDisk.json

Apr 30, 2019 4:17:10 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$DeploymentRegistrar registerDeploymentINFO: AzureVMAgentCleanUpTask: registerDeployment: Registering deployment test-vm0430161709407 in cardinal-jenkins-test

Apr 30, 2019 4:17:13 PM com.microsoft.azure.vmagent.AzureVMCloud createProvisionedAgentINFO: AzureVMCloud: createProvisionedAgent: Waiting for deployment test-vm0430161709407 to be completed

Apr 30, 2019 4:17:43 PM com.microsoft.aad.adal4j.AuthenticationAuthority doInstanceDiscoveryINFO: [Correlation ID: 90fe5df5-b8e3-403c-b4c8-5ad1e8b43a32] Instance discovery was successful

Apr 30, 2019 4:17:43 PM okhttp3.internal.platform.Platform logINFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?

Apr 30, 2019 4:18:07 PM com.microsoft.azure.vmagent.AzureVMCloud createProvisionedAgentINFO: AzureVMCloud: createProvisionedAgent: Deployment test-vm0430161709407 not yet finished (Running): Microsoft.Compute/virtualMachines:test-vm407940 - waited 30 seconds

Apr 30, 2019 4:18:37 PM com.microsoft.aad.adal4j.AuthenticationAuthority doInstanceDiscoveryINFO: [Correlation ID: 22087788-162d-4386-b453-f33292ef2516] Instance discovery was successful

Apr 30, 2019 4:18:38 PM okhttp3.internal.platform.Platform logINFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?

Apr 30, 2019 4:18:48 PM com.microsoft.azure.vmagent.AzureVMCloud createProvisionedAgentINFO: AzureVMCloud: createProvisionedAgent: VM available: test-vm407940

Apr 30, 2019 4:18:49 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate parseResponseINFO: AzureVMManagementServiceDelegate: parseDeploymentResponse: found agent test-vm407940 OS type Linux number of executors 1

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate setVirtualMachineDetailsINFO: The Azure agent doesn't have a public IP. Will use the private IP

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate setVirtualMachineDetailsINFO: Azure agent details:nodeNametest-vm407940adminUserName=azure-jenkinsshutdownOnIdle=falseretentionTimeInMin=0labels=packer

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.AzureVMCloud$3 callINFO: Azure Cloud: provision: Adding agent test-vm407940 to Jenkins nodes

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.AzureVMAgent createComputerINFO: AzureVMAgent: createComputer: start for agent test-vm407940

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.AzureVMCloudPoolRetentionStrategy startINFO: AzureVMCloudRetensionStrategy: start: azureComputer name test-vm407940

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launchINFO: AzureVMAgentSSHLauncher: launch: launch method called for agent test-vm407940

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate isVMAliveOrHealthyINFO: AzureVMManagementServiceDelegate: isVMAliveOrHealthy: status PowerState/running

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSshINFO: AzureVMAgentSSHLauncher: connectToSsh: start

Apr 30, 2019 4:18:51 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSessionINFO: AzureVMAgentSSHLauncher: getRemoteSession: getting remote session for user jenkins to host 10.10.0.42:22

Apr 30, 2019 4:18:52 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSessionINFO: AzureVMAgentSSHLauncher: getRemoteSession: Got remote session for user jenkins to host 10.10.0.42:22

Apr 30, 2019 4:18:52 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSshINFO: AzureVMAgentSSHLauncher: connectToSsh: Got remote connection

Apr 30, 2019 4:18:52 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launchINFO: AzureVMAgentSSHLauncher: launch: checking for java runtime

Apr 30, 2019 4:18:52 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommandINFO: AzureVMAgentSSHLauncher: executeRemoteCommand: starting java fullversion

Apr 30, 2019 4:18:54 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommandINFO: AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully

Apr 30, 2019 4:18:54 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launchINFO: AzureVMAgentSSHLauncher: launch: java runtime present, copying slave.jar to remote

Apr 30, 2019 4:18:54 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemoteINFO: AzureVMAgentSSHLauncher: copyFileToRemote: Initiating file transfer to slave.jar

Apr 30, 2019 4:19:04 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemoteINFO: AzureVMAgentSSHLauncher: copyFileToRemote: copied file Successfully to slave.jar

Apr 30, 2019 4:19:04 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launchINFO: AzureVMAgentSSHLauncher: launch: launching agent: java  -jar slave.jar

Apr 30, 2019 4:19:04 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launchINFO: AzureVMAgentSSHLauncher: launch: Connected successfully

Apr 30, 2019 4:19:13 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launchINFO: AzureVMAgentSSHLauncher: launch: launched agent successfully

Apr 30, 2019 4:23:36 PM com.microsoft.rest.interceptors.LoggingInterceptor logINFO: <- HTTP FAILED: java.net.SocketException: Connection reset

Apr 30, 2019 4:23:36 PM okhttp3.internal.platform.Platform logINFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?

 

by diwang

timja commented 5 years ago

Oops, please ignore my last comment. Even the same error appears, but the vm provisioned successfully this time. I saw it in the jenkins agent list.

However, Cloud Statistics report it failed and no further report of it(test-vm407940), like "Operation time" and etc. 

by diwang

timja commented 5 years ago

With more testing, I config the max vm nunmber is 4 which the pool size is 1.

I submit a branch of requests to trigger it. As it thinks test-vm407940 was failed, then it provisioned another 4 vm. So total is 5 now, more than the max vm number I configured.

by diwang

timja commented 5 years ago

Jie, do we have a fix for this? Jie Shen

by diwang

timja commented 5 years ago

di wang I have tested with vnet configuration(delete and provision several times), but I still get no luck to reproduce your error. I use the latest version on the dev branch for the configuration as code work done on the branch and it is should be a stable one with some fixes. And below is my configuration:

- azureVM:
azureCredentialsId: "imds"
cloudName: "jsdev"
configurationStatus: "pass"
deploymentTimeout: 1200
existingResourceGroupName: "Testing"
maxVirtualMachinesLimit: 10
newResourceGroupName: "jieshe-dev-agents"
resourceGroupReferenceType: "new"
vmTemplates:
- agentLaunchMethod: "SSH"
availabilityType:
availabilitySet: "avset"
builtInImage: "Windows Server 2016"
credentialsId: "agent_admin_account"
diskType: "managed"
doNotUseMachineIfInitFails: true
enableMSI: false
executeInitScriptAsRoot: true
existingStorageAccountName: "fsxvergehbetgzsvfg"
imageReference:
offer: "UbuntuServer"
publisher: "Canonical"
sku: "16.04-LTS"
version: "latest"
imageTopLevelType: "advanced"
initScript: "sudo add-apt-repository ppa:openjdk-r/ppa -y\nsudo apt-get -y\
\ update\nsudo apt-get install openjdk-8-jre openjdk-8-jre-headless openjdk-8-jdk\
\ -y"
installDocker: false
installGit: false
installMaven: false
labels: "advanced"
location: "East US"
newStorageAccountName: "asdfwefwef"
noOfParallelJobs: 1
osDiskSize: 0
osType: "Linux"
preInstallSsh: true
retentionStrategy:
azureVMCloudPool:
poolSize: 1
retentionInHours: 24
shutdownOnIdle: false
storageAccountNameReferenceType: "new"
storageAccountType: "Standard_LRS"
subnetName: "jenkins"
templateDisabled: false
templateName: "advanced"
usageMode: "Use this node as much as possible"
usePrivateIP: true
virtualMachineSize: "Standard_F2s"
virtualNetworkName: "jenkins-vnet"
virtualNetworkResourceGroupName: "jieshe-jenkins"

Do I miss anything in the configuration that make me unable to reproduce the error on my side?
 

by jieshe

timja commented 5 years ago

For provisioning more nodes than the limits, it is an existing issue GH-56972 caused by bad concurrency control. I will try to fix this later.

by jieshe