jenkinsci / azure-vm-agents-plugin

This repo is for azure vm agents plugin for jenkins. Azure devops CICD is the team which owns it for now
https://plugins.jenkins.io/azure-vm-agents/
43 stars 99 forks source link

MaxDeploymentSize in templates is ignored and maxVirtualMachinesLimit is used as hard limit #344

Closed mkrzywanski closed 2 years ago

mkrzywanski commented 2 years ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.322 OS: Linux - 5.4.0-1062-azure --- ace-editor:1.1 apache-httpcomponents-client-4-api:4.5.13-1.0 azure-credentials:198.vf9c2fdfde55c azure-keyvault:131.v867845ef6ae9 azure-sdk:70.v63f6a95999a7 azure-vm-agents:799.va4c741108611 bootstrap4-api:4.6.0-3 bootstrap5-api:5.1.3-3 bouncycastle-api:2.25 branch-api:2.7.0 build-timestamp:1.0.3 build-user-vars-plugin:1.8 caffeine-api:2.9.2-29.v717aac953ff3 checks-api:1.7.2 cloud-stats:0.27 cloudbees-folder:6.16 command-launcher:1.6 configuration-as-code:1.54 credentials:2.6.2 credentials-binding:1.27 display-url-api:2.3.5 durable-task:493.v195aefbb0ff2 echarts-api:5.2.2-1 extended-read-permission:3.2 font-awesome-api:5.15.4-3 git:4.10.0 git-client:3.10.0 git-server:1.10 handlebars:3.0.8 jackson2-api:2.13.0-230.v59243c64b0a5 jaxb:2.3.0 jdk-tool:1.5 jquery-detached:1.2.1 jquery3-api:3.6.0-2 jsch:0.1.55.2 junit:1.53 ldap:2.7 lockable-resources:2.12 mailer:1.34 matrix-auth:2.6.8 matrix-project:1.19 momentjs:1.1.1 pipeline-build-step:2.15 pipeline-graph-analysis:1.12 pipeline-input-step:2.12 pipeline-milestone-step:1.3.2 pipeline-model-api:1.9.3 pipeline-model-declarative-agent:1.1.1 pipeline-model-definition:1.9.3 pipeline-model-extensions:1.9.3 pipeline-rest-api:2.19 pipeline-stage-step:2.5 pipeline-stage-tags-metadata:1.9.3 pipeline-stage-view:2.19 plain-credentials:1.7 plugin-util-api:2.5.1 popper-api:1.16.1-2 popper2-api:2.10.2-1 rebuild:1.32 resource-disposer:0.16 role-strategy:3.2.0 scm-api:2.6.5 script-security:1.78 snakeyaml-api:1.29.1 ssh-credentials:1.19 sshd:3.1.0 structs:1.24 throttle-concurrents:2.5 timestamper:1.15 trilead-api:1.0.13 workflow-aggregator:2.6 workflow-api:2.47 workflow-basic-steps:2.24 workflow-cps:2633.v6baeedc13805 workflow-cps-global-lib:548.v9085a486966a workflow-durable-task-step:1101.vf832bc1ac745 workflow-job:2.42 workflow-multibranch:2.26 workflow-scm-step:2.13 workflow-step-api:2.24 workflow-support:3.8 ws-cleanup:0.39 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Jenkins in docker container

Reproduction steps

I have the plugin configured as follows :

clouds:
  - azureVM:
      azureCredentialsId: "xxx"
      cloudName: "azure"
      cloudTags:
      - name: "deployer"
        value: "xxx"
      configurationStatus: "pass"
      deploymentTimeout: 1200
      existingResourceGroupName: "xxx"
      maxVirtualMachinesLimit: 90
      resourceGroupReferenceType: "existing"
      vmTemplates:
      - agentLaunchMethod: "SSH"
        agentWorkspace: "/var/jenkins"
        builtInImage: "Ubuntu 20.04 LTS"
        credentialsId: "jenkins"
        diskType: "managed"
        doNotUseMachineIfInitFails: true
        executeInitScriptAsRoot: true
        existingStorageAccountName: "xxx"
        imageReference:
          id: "xxx"
        imageTopLevelType: "advanced"
        javaPath: "java"
        labels: "xxx"
        location: "West Europe"
        maximumDeploymentSize: 10
        noOfParallelJobs: 1
        osType: "Linux"
        retentionStrategy:
          azureVMCloudRetentionStrategy:
            idleTerminationMinutes: 10
        storageAccountNameReferenceType: "existing"
        storageAccountType: "Standard_LRS"
        subnetName: "xxx"
        tags:
        - name: "application"
          value: "xxx"
        templateDesc: "xxx"
        templateName: "xxx"
        usageMode: EXCLUSIVE
        usePrivateIP: true
        virtualMachineSize: "Standard_E2s_v3"
        virtualNetworkName: "xxx"
        virtualNetworkResourceGroupName: "xxx"

As we can see the maxVirtualMachinesLimit is set to 90 and maximumDeploymentSize is set to 10. However maximumDeploymentSize seems to be ignored and up to 90 machines are spinned when using this configuration. Of course I have more images confgured this way and they all have maximumDeploymentSize set to 10. However this option is ignored.

In the jenkins FINE logs I can see :

Mar 10, 2022 12:28:44 PM FINE com.microsoft.azure.vmagent.AzureVMCloud
Current estimated VM count: 90, quantity desired 2
Mar 10, 2022 12:28:44 PM INFO com.microsoft.azure.vmagent.AzureVMCloud provision
Not able to create 2 nodes, at or above maximum VM count of 90 and already 90 VM(s)

I checked the source code and there is a check which is never accessed and I cannot see it in the logs https://github.com/jenkinsci/azure-vm-agents-plugin/blob/a74208d4f7a1069427145e61b260372d8d6cd50c/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java#L678

Expected Results

maximumDeploymentSize is not ignored and correctly limits the virtual machine amount per template.

Actual Results

maximumDeploymentSize is ignored and maxVirtualMachinesLimit is used as hard limit

Anything else?

No response

timja commented 2 years ago

You've set the limit to 90 in maxVirtualMachinesLimit and the log says you have 90 running?

maximumDeploymentSize is the number that will be deployed at a time.

If you check the deployment logs in azure you should see how many were deployed in each deployment

mkrzywanski commented 2 years ago

@timja Actually I thought that maximumDeploymentSize is used to control maximum number of template instances at given time. Now I have checked that you actually implemented something that is used to limit template instances and is called maxVirtualMachinesLimit, am I right? If so, I will update my plugin version and will give it a try.

timja commented 2 years ago

Yes correct

mkrzywanski commented 2 years ago

I have tried to use maxVirtualMachinesLimit to define per template constraints. I think there is some problem because i have defined maxVirtualMachinesLimit to amount of 90 on the cloud definition level and then set maxVirtualMachinesLimit to 10 for each template. Now regardless of the template type only 10 machines can be provisioned at time for entire cloud.

timja commented 2 years ago

That's the expected behaviour.

Cloud definition sets a global limit Template definition sets a per-template limit

If you set 10 on each template then you won't get more than 10

You can use this feature to make sure you don't get 90 of one template and can't spawn the other one

mkrzywanski commented 2 years ago

So to make it clear for such setup :

In this situation only 10 machines can be spawned at a time regardless of the template type? If I want to run multiple jobs with different templates at the time only 10 machines overall can be spawned?

timja commented 2 years ago

Yes

mkrzywanski commented 2 years ago

To be honest I do not know when to use such feature. I thought that I could set max agents limit to 30 for example. And the for example say that I want to have 10 instances of one template at maximum but it would not prevent other templates to spawn to the limit of 30. So actually in the setup I have just shown I would have 30 machines at the time, 10 instances of each template - but seems it does not work like it.

timja commented 2 years ago

If I want to run multiple jobs with different templates at the time only 10 machines overall can be spawned?

Apologies, mis-understood. You should defo end up with more than 10 agents if you're using multiple templates, can you share your config please?

mkrzywanski commented 2 years ago

Jenkins version and plugins :

Jenkins: 2.338
OS: Linux - 5.4.0-1062-azure
---
ace-editor:1.1
antisamy-markup-formatter:2.7
apache-httpcomponents-client-4-api:4.5.13-1.0
azure-credentials:216.ve0b_4a_485ffc2
azure-keyvault:131.v867845ef6ae9
azure-sdk:106.v552de1e64d56
azure-vm-agents:808.v9d1999587120
bootstrap4-api:4.6.0-3
bootstrap5-api:5.1.3-6
bouncycastle-api:2.25
branch-api:2.7.0
build-timestamp:1.0.3
build-user-vars-plugin:1.8
caffeine-api:2.9.2-29.v717aac953ff3
checks-api:1.7.2
cloud-stats:0.27
cloudbees-folder:6.708.ve61636eb_65a_5
command-launcher:1.6
configuration-as-code:1414.v878271fc496f
credentials:1074.v60e6c29b_b_44b_
credentials-binding:1.27.1
display-url-api:2.3.5
durable-task:493.v195aefbb0ff2
echarts-api:5.3.0-2
extended-read-permission:3.2
font-awesome-api:6.0.0-1
git:4.10.3
git-client:3.11.0
git-server:1.10
handlebars:3.0.8
hidden-parameter:0.0.4
jackson2-api:2.13.2-260.v43d711474c77
javax-activation-api:1.2.0-2
javax-mail-api:1.6.2-5
jaxb:2.3.0.1
jdk-tool:1.5
jquery-detached:1.2.1
jquery3-api:3.6.0-2
jsch:0.1.55.2
junit:1.56
ldap:2.8
lockable-resources:2.14
mailer:408.vd726a_1130320
matrix-auth:2.6.8
matrix-project:758.v7a_ea_491852f3
momentjs:1.1.1
pipeline-build-step:2.16
pipeline-graph-analysis:188.v3a01e7973f2c
pipeline-input-step:446.vf27b_0b_83500e
pipeline-milestone-step:100.v60a_03cd446e1
pipeline-model-api:2.2064.v5eef7d0982b_e
pipeline-model-declarative-agent:1.1.1
pipeline-model-definition:2.2064.v5eef7d0982b_e
pipeline-model-extensions:2.2064.v5eef7d0982b_e
pipeline-rest-api:2.23
pipeline-stage-step:291.vf0a8a7aeeb50
pipeline-stage-tags-metadata:2.2064.v5eef7d0982b_e
pipeline-stage-view:2.23
plain-credentials:1.8
plugin-util-api:2.14.0
popper-api:1.16.1-2
popper2-api:2.11.2-1
rebuild:1.33
resource-disposer:0.17
role-strategy:3.2.0
scm-api:595.vd5a_df5eb_0e39
script-security:1140.vf967fb_efa_55a_
snakeyaml-api:1.29.1
ssh-credentials:1.19
sshd:3.1.0
structs:308.v852b473a2b8c
throttle-concurrents:2.6
timestamper:1.17
trilead-api:1.0.13
uno-choice:2.6.0
windows-slaves:1.8
workflow-aggregator:2.7
workflow-api:1143.v2d42f1e9dea_5
workflow-basic-steps:941.vdfe1b_a_132c64
workflow-cps:2660.vb_c0412dc4e6d
workflow-cps-global-lib:564.ve62a_4eb_b_e039
workflow-durable-task-step:1121.va_65b_d2701486
workflow-job:1174.vdcb_d054cf74a_
workflow-multibranch:711.vdfef37cda_816
workflow-scm-step:2.13
workflow-step-api:622.vb_8e7c15b_c95a_
workflow-support:815.vd60466279fc8
ws-cleanup:0.40

Cloud config :

clouds:
  - azureVM:
      azureCredentialsId: "credentials"
      cloudName: "azure"
      cloudTags:
      - name: "deployer"
        value: "test"
      configurationStatus: "pass"
      deploymentTimeout: 1200
      existingResourceGroupName: "rg"
      maxVirtualMachinesLimit: 30
      resourceGroupReferenceType: "existing"
      vmTemplates:
      - agentLaunchMethod: "SSH"
        agentWorkspace: "/var/jenkins"
        builtInImage: "Ubuntu 20.04 LTS"
        credentialsId: "jenkins"
        diskType: "managed"
        doNotUseMachineIfInitFails: true
        executeInitScriptAsRoot: true
        existingStorageAccountName: "storage"
        imageReference:
          id: "xx"
        imageTopLevelType: "advanced"
        javaPath: "java"
        labels: "template1"
        location: "West Europe"
        maxVirtualMachinesLimit: 10
        maximumDeploymentSize: 10
        noOfParallelJobs: 1
        osType: "Linux"
        retentionStrategy:
          azureVMCloudRetentionStrategy:
            idleTerminationMinutes: 10
        storageAccountNameReferenceType: "existing"
        storageAccountType: "Standard_LRS"
        subnetName: "xxx"
        tags:
        - name: "application"
          value: "xxx"
        templateDesc: "template1"
        templateName: "template1"
        usageMode: EXCLUSIVE
        usePrivateIP: true
        virtualMachineSize: "Standard_E2s_v3"
        virtualNetworkName: "xxx"
        virtualNetworkResourceGroupName: "xxx"
      - agentLaunchMethod: "SSH"
        agentWorkspace: "/var/jenkins"
        builtInImage: "Ubuntu 20.04 LTS"
        credentialsId: "jenkins"
        diskType: "managed"
        doNotUseMachineIfInitFails: true
        executeInitScriptAsRoot: true
        existingStorageAccountName: "storage"
        imageReference:
          id: "xxx"
        imageTopLevelType: "advanced"
        javaPath: "java"
        labels: "template2"
        location: "West Europe"
        maxVirtualMachinesLimit: 10
        maximumDeploymentSize: 10
        noOfParallelJobs: 2
        osType: "Linux"
        retentionStrategy:
          azureVMCloudRetentionStrategy:
            idleTerminationMinutes: 10
        storageAccountNameReferenceType: "existing"
        storageAccountType: "Standard_LRS"
        subnetName: "xxx"
        tags:
        - name: "application"
          value: "xxx"
        templateDesc: "template2"
        templateName: "template2"
        usageMode: EXCLUSIVE
        usePrivateIP: true
        virtualMachineSize: "Standard_B8ms"
        virtualNetworkName: "xxx"
        virtualNetworkResourceGroupName: "xxx"
      - agentLaunchMethod: "SSH"
        agentWorkspace: "/var/jenkins"
        builtInImage: "Ubuntu 20.04 LTS"
        credentialsId: "jenkins"
        diskType: "managed"
        doNotUseMachineIfInitFails: true
        executeInitScriptAsRoot: true
        existingStorageAccountName: "storage"
        imageReference:
          id: "xxx"
        imageTopLevelType: "advanced"
        javaPath: "java"
        labels: "template3"
        location: "West Europe"
        maxVirtualMachinesLimit: 10
        maximumDeploymentSize: 10
        noOfParallelJobs: 1
        osType: "Linux"
        retentionStrategy:
          azureVMCloudRetentionStrategy:
            idleTerminationMinutes: 10
        storageAccountNameReferenceType: "existing"
        storageAccountType: "Standard_LRS"
        subnetName: "xxx"
        tags:
        - name: "application"
          value: "xxx"
        templateDesc: "template3"
        templateName: "template3"
        usageMode: EXCLUSIVE
        usePrivateIP: true
        virtualMachineSize: "Standard_DS1_v2"
        virtualNetworkName: "xxx"
        virtualNetworkResourceGroupName: "xxx"

In this situation I get 10 machines max overall :

Maximum 10 machines are provisioned, and not 30. In the logs I can see :

Mar 16, 2022 10:22:10 AM FINE com.microsoft.azure.vmagent.AzureVMCloud
Current estimated VM count: 10, quantity desired 2
Mar 16, 2022 10:22:10 AM INFO com.microsoft.azure.vmagent.AzureVMCloud provision
Not able to create 2 nodes, at or above maximum VM count of 10 and already 10 VM(s)
mkrzywanski commented 2 years ago

@timja will you try to have a look at this as you implemented it recently? This feature is something we really need and right now we have to make workarounds with throttle plugin.

timja commented 2 years ago

It looks like this method: adjustVirtualMachineCount is not taking into account the current template count.

https://github.com/jenkinsci/azure-vm-agents-plugin/blob/master/src/main/java/com/microsoft/azure/vmagent/AzureVMCloud.java#L381

Only the 'max agent count' and the 'template limit'

If Other templates have taken up what is in the template limit but less than max limit no more VMs will spawn

I don't have time right now to do a write and test a fix, but I should be able to in the next couple of days hopefully.

pjdarton commented 2 years ago

@timja We've (just this morning) come to exactly the same conclusion - the adjustVirtualMachineCount method is comparing the templateLimit against the "total of all VMs in the Azure resource group". e.g. we have lots (over 10) of templates, each with a small (e.g. 5) templateLimit set, and the moment the total number of Azure VMs in play exceeds those limits, nothing new is provisioned and our developers start complaining that their builds aren't running.

What should happen is that the template's limit should be compared against the number of VMs made from that template rather than the total number in the cloud.

FYI (several years ago) I encountered much the same issue with the docker-plugin and I solved that using labels - I made the plugin label every docker container it made with both a "it came from Jenkins" label and a "it came from template" label so that, when the plugin later went counting containers, it could (a) ignore stuff that it wasn't responsible for and (b) figure out a per-template count. I think that this plugin suffers from much the same problems as the docker-plugin had ... but the same kind of solution may be able to be used here - it looks like AzureVMManagementServiceDelegate's getVirtualMachineCount method can filter by tags so all it needs to do is "just" differentiate between templates using tags too. It looks like the plugin already adds all the tags it'd need too!

What I'd suggest is:

  1. replace AzureVMCloud's existing field int approximateVirtualMachineCount with Map<String, Integer> approximateVirtualMachineCountsByTemplate
  2. rewrite AzureVMCloud's getApproximateVirtualMachineCount() method to sum all those Integers
  3. add a new method alongside called getApproximateVirtualMachineCountForTemplate(String templateName)
  4. split AzureVMManagementServiceDelegate's getVirtualMachineCount into two with a new method that returns a Map<String, Integer> where the index string is the Constants.AZURE_TEMPLATE_TAG_NAME tag and make the old method call the new and sum all the Integers (unless there's nothing else that calls the old method).
  5. Have the AzureVMCloudVerificationTask set approximateVirtualMachineCountsByTemplate periodically
  6. Have adjustVirtualMachineCount take into account BOTH the per-template limit (compared against getApproximateVirtualMachineCountForTemplate(templateName)) AND the cloud max limit (compared against getApproximateVirtualMachineCount())
  7. Improve the logging so that we see every decision and the numbers involved in that decision
  8. ...and also include the template name in that logging so we can tell which template it's talking about too.
  9. refactor to make this logic unit-testable and unit-test the logic with some basic scenarios so we can be sure the maths & logic is correct :wink:

...and feel free to steal/be-inspired-by code in the docker-plugin - the license is permissive.

timja commented 2 years ago

Thanks, I forgot about this issue, I have a few other in flight pieces of work but it’s in my backlog.

Contributions are very much welcome though