jenkinsci / azure-vm-agents-plugin

This repo is for azure vm agents plugin for jenkins. Azure devops CICD is the team which owns it for now
https://plugins.jenkins.io/azure-vm-agents/
46 stars 102 forks source link

Windows VMs silently deallocate with pool retention and idle retention time set to 0 #330

Open tchrischan opened 2 years ago

tchrischan commented 2 years ago

Version report

Jenkins and plugins versions report:

Jenkins: 2.319.1
OS: Linux - 4.19.0-16-cloud-amd64
---
ace-editor:1.1
ant:1.13
antisamy-markup-formatter:2.1
apache-httpcomponents-client-4-api:4.5.13-1.0
authentication-tokens:1.4
authorize-project:1.4.0
azure-acs:1.0.4
azure-ad:185.v3b416408dcb1
azure-commons:1.1.3
azure-container-registry-tasks:0.6.5
azure-credentials:198.vf9c2fdfde55c
azure-sdk:70.v63f6a95999a7
azure-vm-agents:799.va4c741108611
basic-branch-build-strategies:1.3.2
blueocean:1.25.2
blueocean-autofavorite:1.2.4
blueocean-bitbucket-pipeline:1.25.2
blueocean-commons:1.25.2
blueocean-config:1.25.2
blueocean-core-js:1.25.2
blueocean-dashboard:1.25.2
blueocean-display-url:2.4.1
blueocean-events:1.25.2
blueocean-git-pipeline:1.25.2
blueocean-github-pipeline:1.25.2
blueocean-i18n:1.25.2
blueocean-jira:1.25.2
blueocean-jwt:1.25.2
blueocean-personalization:1.25.2
blueocean-pipeline-api-impl:1.25.2
blueocean-pipeline-editor:1.25.2
blueocean-pipeline-scm-api:1.25.2
blueocean-rest:1.25.2
blueocean-rest-impl:1.25.2
blueocean-web:1.25.2
bootstrap4-api:4.6.0-3
bootstrap5-api:5.1.3-3
bouncycastle-api:2.25
branch-api:2.7.0
build-timeout:1.20
caffeine-api:2.9.2-29.v717aac953ff3
checks-api:1.7.2
cloud-stats:0.27
cloudbees-bitbucket-branch-source:2.9.11
cloudbees-folder:6.16
cmakebuilder:4.1.1
cobertura:1.17
code-coverage-api:2.0.4
command-launcher:1.6
configuration-as-code:1.55
credentials:2.6.1
credentials-binding:1.27
data-tables-api:1.11.3-4
discard-old-build:1.05
display-url-api:2.3.5
docker-build-step:2.8
docker-commons:1.17
docker-java-api:3.1.5.2
docker-plugin:1.2.3
docker-workflow:1.26
durable-task:493.v195aefbb0ff2
echarts-api:5.2.2-1
email-ext:2.85
extended-read-permission:3.2
favorite:2.3.2
font-awesome-api:5.15.4-3
forensics-api:1.7.0
git:4.10.0
git-client:3.10.0
git-server:1.10
github:1.34.1
github-api:1.301-378.v9807bd746da5
github-branch-source:2.11.3
github-checks:1.0.13
github-oauth:0.35
github-pr-coverage-status:2.1.1
global-slack-notifier:1.5
google-oauth-plugin:1.0.6
gradle:1.36
handlebars:1.1.1
handy-uri-templates-2-api:2.1.8-1.0
htmlpublisher:1.25
influxdb:3.0.2
jackson2-api:2.13.0-230.v59243c64b0a5
jaxb:2.3.0.1
jdk-tool:1.4
jenkins-design-language:1.25.2
jira:3.1.3
jjwt-api:0.11.2-9.c8b45b8bb173
jobConfigHistory:2.28.1
jquery-detached:1.2.1
jquery3-api:3.6.0-2
jsch:0.1.55.2
junit:1.53
kubernetes:1.30.11
kubernetes-cd:2.3.1
kubernetes-client-api:5.4.1
kubernetes-credentials:0.9.0
ldap:2.3
llvm-cov:1.0.0
lockable-resources:2.10
mailer:1.34
mapdb-api:1.0.9.0
matrix-auth:2.6.9
matrix-project:1.19
mercurial:2.12
metrics:4.0.2.8
momentjs:1.1.1
multibranch-build-strategy-extension:1.0.10
oauth-credentials:0.5
okhttp-api:3.14.9
pam-auth:1.6
pipeline-build-step:2.15
pipeline-github-lib:1.0
pipeline-graph-analysis:1.12
pipeline-input-step:2.12
pipeline-milestone-step:1.3.2
pipeline-model-api:1.9.3
pipeline-model-definition:1.9.3
pipeline-model-extensions:1.9.3
pipeline-rest-api:2.19
pipeline-stage-step:2.5
pipeline-stage-tags-metadata:1.9.3
pipeline-stage-view:2.19
plain-credentials:1.7
plugin-usage-plugin:2.1
plugin-util-api:2.5.1
popper-api:1.16.1-2
popper2-api:2.10.2-1
pubsub-light:1.16
resource-disposer:0.16
scm-api:2.6.5
script-security:1.78
slack:2.49
snakeyaml-api:1.29.1
sse-gateway:1.24
ssh-agent:1.23
ssh-credentials:1.19
ssh-slaves:1.33.0
sshd:3.1.0
structs:308.v852b473a2b8c
subversion:2.15.1
timestamper:1.15
token-macro:266.v44a80cf277fd
trilead-api:1.0.13
variant:1.4
windows-slaves:1.7
workflow-aggregator:2.6
workflow-api:2.47
workflow-basic-steps:2.24
workflow-cps:2640.v00e79c8113de
workflow-cps-global-lib:548.v9085a486966a
workflow-durable-task-step:1101.vf832bc1ac745
workflow-job:2.42
workflow-multibranch:2.26
workflow-scm-step:2.13
workflow-step-api:2.24
workflow-support:3.8
ws-cleanup:0.39
Controller: Debian 10
Agents: Windows Server 2019 Datacenter (Azure VMs)

Reproduction steps

Results

Expected result:

At least 1 Windows VM should be available for scheduling at all times (the template limit is 3)

Actual result:

Have to manually "un-suspend" the Windows VM several times a day. This is a problem because there is no notification the agent is suspended, and most of the team is several time zones away.

timja commented 2 years ago

could you provide your configuration (redact whatever you need to)? Ideally as a configuration-as-code plugin export.

tchrischan commented 2 years ago

No, I don't know why the bulitInImage for a Windows template is ubuntu 20.04 LTS (or why it's not correct for any of my templates). You can ignore the init script, we tried to extend the OS disk but I don't think that worked; the same VM behavior happened before that script was put in earlier this week.

[...]
jenkins:
[...]
  clouds:
  - azureVM:
      azureCredentialsId: ***redacted***
      cloudName: ***redacted***
      configurationStatus: "pass"
      deploymentTimeout: 1200
      existingResourceGroupName: ***redacted***
      maxVirtualMachinesLimit: 20
      resourceGroupReferenceType: "existing"
      vmTemplates:
[...]
      - agentLaunchMethod: "SSH"
        agentWorkspace: "c:\\jenkins"
        builtInImage: "Ubuntu 20.04 LTS"
        credentialsId: ***redacted***
        diskType: "managed"
        doNotUseMachineIfInitFails: false
        executeInitScriptAsRoot: false
        existingStorageAccountName: ***redacted***
        imageReference:
          galleryImageDefinition: ***redacted***
          galleryImageVersion: ***redacted***
          galleryName: ***redacted***
          galleryResourceGroup: ***redacted***
          gallerySubscriptionId: ***redacted***
        imageTopLevelType: "advanced"
        initScript: "Resize-Partition -DriveLetter C -Size ((Get-PartitionSupportedSize\
          \ -DriveLetter C).SizeMax)"
        javaPath: "java"
        labels: "windows"
        location: "East US"
        maximumDeploymentSize: 3
        newStorageAccountName: ***redacted***
        noOfParallelJobs: 1
        osDiskSize: 300
        osDiskStorageAccountType: "StandardSSD_LRS"
        osType: "Windows"
        retentionStrategy:
          azureVMCloudPool:
            poolSize: 1
            retentionInHours: 0
        shutdownOnIdle: true
        storageAccountNameReferenceType: "existing"
        storageAccountType: "Standard_LRS"
        subnetName: ***redacted***
        templateDesc: "Windows 2019 Datacenter pre-loaded for ***redacted*** builds"
        templateName: ***redacted***
        usageMode: EXCLUSIVE
        usePrivateIP: true
        virtualMachineSize: "Standard_D4ds_v5"
        virtualNetworkName: ***redacted***
        virtualNetworkResourceGroupName: ***redacted***
[...]
timja commented 2 years ago

If you’re managing it with jcasc you can remove built in image since https://github.com/jenkinsci/azure-vm-agents-plugin/releases/tag/795.vd5903dae1139

Doesn’t cause any harm though

timja commented 2 years ago

Why don’t you disable shutdownOnIdle?

tchrischan commented 2 years ago

Because deleting the VM on idle instead means the local vcpkg cache is rebuilt on every run, so they will each take a very long time. That's what we're trying to avoid.