jenkinsci / openstack-cloud-plugin

Provision nodes from OpenStack on demand
https://plugins.jenkins.io/openstack-cloud
MIT License
47 stars 83 forks source link

2.64/2.65 keeps launching duplicate VMs until JNLP is connected #374

Open steveames opened 10 months ago

steveames commented 10 months ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.414.2 OS: Linux - 6.2.0-1013-aws Java: 11.0.20.1 - Ubuntu (OpenJDK 64-Bit Server VM) --- Parameterized-Remote-Trigger:3.2.0 PrioritySorter:5.0.0 active-directory:2.33 analysis-model-api:11.10.0 ant:497.v94e7d9fffa_b_9 antisamy-markup-formatter:162.v0e6ec0fcfcf6 apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 artifactory:3.18.12 authentication-tokens:1.53.v1c90fd9191a_b_ aws-credentials:218.v1b_e9466ec5da_ aws-java-sdk:1.12.529-406.vdeff15e5817d aws-java-sdk-cloudformation:1.12.529-406.vdeff15e5817d aws-java-sdk-codebuild:1.12.529-406.vdeff15e5817d aws-java-sdk-ec2:1.12.529-406.vdeff15e5817d aws-java-sdk-ecr:1.12.529-406.vdeff15e5817d aws-java-sdk-ecs:1.12.529-406.vdeff15e5817d aws-java-sdk-efs:1.12.529-406.vdeff15e5817d aws-java-sdk-elasticbeanstalk:1.12.529-406.vdeff15e5817d aws-java-sdk-iam:1.12.529-406.vdeff15e5817d aws-java-sdk-kinesis:1.12.529-406.vdeff15e5817d aws-java-sdk-logs:1.12.529-406.vdeff15e5817d aws-java-sdk-minimal:1.12.529-406.vdeff15e5817d aws-java-sdk-secretsmanager:1.12.529-406.vdeff15e5817d aws-java-sdk-sns:1.12.529-406.vdeff15e5817d aws-java-sdk-sqs:1.12.529-406.vdeff15e5817d aws-java-sdk-ssm:1.12.529-406.vdeff15e5817d basic-branch-build-strategies:81.v05e333931c7d bitbucket:223.vd12f2bca5430 blueocean:1.27.8 blueocean-autofavorite:1.2.5 blueocean-bitbucket-pipeline:1.27.8 blueocean-commons:1.27.8 blueocean-config:1.27.8 blueocean-core-js:1.27.8 blueocean-dashboard:1.27.8 blueocean-display-url:2.4.2 blueocean-events:1.27.8 blueocean-git-pipeline:1.27.8 blueocean-github-pipeline:1.27.8 blueocean-i18n:1.27.8 blueocean-jira:1.27.8 blueocean-jwt:1.27.8 blueocean-personalization:1.27.8 blueocean-pipeline-api-impl:1.27.8 blueocean-pipeline-editor:1.27.8 blueocean-pipeline-scm-api:1.27.8 blueocean-rest:1.27.8 blueocean-rest-impl:1.27.8 blueocean-web:1.27.8 bootstrap5-api:5.3.2-1 bouncycastle-api:2.29 branch-api:2.1128.v717130d4f816 build-environment:1.7 build-monitor-plugin:1.14-745.ve2023a_305f40 build-timeout:1.31 build-timestamp:1.0.3 build-user-vars-plugin:1.9 build-with-parameters:76.v9382db_f78962 buildtriggerbadge:251.vdf6ef853f3f5 caffeine-api:3.1.8-133.v17b_1ff2e0599 checks-api:2.0.2 cloud-stats:320.v96b_65297a_4b_b_ cloudbees-bitbucket-branch-source:848.v42c6a_317eda_e cloudbees-folder:6.848.ve3b_fd7839a_81 cobertura:1.17 code-coverage-api:4.9.0 command-launcher:107.v773860566e2e commons-lang3-api:3.13.0-62.v7d18e55f51e2 commons-text-api:1.10.0-78.v3e7b_ea_d5a_fe1 conditional-buildstep:1.4.3 config-file-provider:959.vcff671a_4518b_ console-badge:1.1 credentials:1293.vff276f713473 credentials-binding:636.v55f1275c7b_27 custom-tools-plugin:0.8 data-tables-api:1.13.6-5 display-url-api:2.200.vb_9327d658781 docker-commons:439.va_3cb_0a_6a_fb_29 docker-workflow:572.v950f58993843 durable-task:523.va_a_22cf15d5e0 ec2:1628.v6d7b_fc58b_a_1d echarts-api:5.4.0-6 email-ext:2.102 embeddable-build-status:412.v09da_db_1dee68 envinject-api:1.199.v3ce31253ed13 extended-choice-parameter:376.v2e02857547b_a_ extended-read-permission:53.v6499940139e5 external-monitor-job:215.v2e88e894db_f8 favorite:2.4.3 file-leak-detector:1.12 font-awesome-api:6.4.2-1 forensics-api:2.3.0 git:5.2.0 git-client:4.5.0 git-server:99.va_0826a_b_cdfa_d github:1.37.3 github-api:1.316-451.v15738eef3414 github-branch-source:1741.va_3028eb_9fd21 github-checks:554.vb_ee03a_000f65 gradle:2.8.2 groovy:457.v99900cb_85593 handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953 htmlpublisher:1.32 instance-identity:173.va_37c494ec4e5 ionicons-api:56.v1b_1c8c49374e ivy:2.5 jackson2-api:2.15.3-366.vfe8d1fa_f8c87 jacoco:3.3.5 jakarta-activation-api:2.0.1-3 jakarta-mail-api:2.0.1-3 javadoc:243.vb_b_503b_b_45537 javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.8-1 jdk-tool:73.vddf737284550 jenkins-design-language:1.27.8 jersey2-api:2.40-1 jira:3.11 jjwt-api:0.11.5-77.v646c772fddb_0 jnr-posix-api:3.1.18-1 job-restrictions:0.8 jquery:1.12.4-1 jquery3-api:3.7.1-1 jsch:0.2.8-65.v052c39de79b_2 junit:1240.vf9529b_881428 junit-realtime-test-reporter:135.vf92a_7fe68b_15 label-linked-jobs:6.0.1 ldap:701.vf8619de9160a_ lockable-resources:1185.v0c528656ce04 mailer:463.vedf8358e006b_ mapdb-api:1.0.9-28.vf251ce40855d matrix-auth:3.2.1 matrix-project:818.v7eb_e657db_924 maven-plugin:3.23 mercurial:1260.vdfb_723cdcc81 metrics:4.2.18-442.v02e107157925 mina-sshd-api-common:2.10.0-69.v28e3e36d18eb_ mina-sshd-api-core:2.10.0-69.v28e3e36d18eb_ node-iterator-api:49.v58a_8b_35f8363 okhttp-api:4.11.0-157.v6852a_a_fa_ec11 openstack-cloud:2.64 p4:1.14.3 pam-auth:1.10 parallel-test-executor:418.v24f9a_141d726 parameterized-scheduler:255.v73827fcdf618 parameterized-trigger:2.46 pipeline-aws:1.43 pipeline-build-step:505.v5f0844d8d126 pipeline-graph-analysis:202.va_d268e64deb_3 pipeline-graph-view:202.v6da_a_9e590325 pipeline-groovy-lib:689.veec561a_dee13 pipeline-input-step:477.v339683a_8d55e pipeline-milestone-step:111.v449306f708b_7 pipeline-model-api:2.2144.v077a_d1928a_40 pipeline-model-definition:2.2144.v077a_d1928a_40 pipeline-model-extensions:2.2144.v077a_d1928a_40 pipeline-rest-api:2.33 pipeline-stage-step:305.ve96d0205c1c6 pipeline-stage-tags-metadata:2.2144.v077a_d1928a_40 pipeline-stage-view:2.33 pipeline-utility-steps:2.16.0 plain-credentials:143.v1b_df8b_d3b_e48 plugin-usage-plugin:4.2 plugin-util-api:3.6.0 prism-api:1.29.0-8 pubsub-light:1.17 random-string-parameter:1.0 resource-disposer:0.23 run-condition:1.7 saferestart:0.7 saml:4.429.v9a_781a_61f1da_ scm-api:676.v886669a_199a_a_ scoring-load-balancer:59.vf791549fa_989 script-security:1275.v23895f409fb_d skip-certificate-check:1.1 slack:684.v833089650554 snakeyaml-api:2.2-111.vc6598e30cc65 sonar:2.15 sse-gateway:1.26 ssh-agent:333.v878b_53c89511 ssh-credentials:308.ve4497b_ccd8f4 ssh-slaves:2.916.vd17b_43357ce4 sshd:3.312.v1c601b_c83b_0e structs:325.vcb_307d2a_2782 subversion:2.17.3 support-core:1356.vd0f980edfa_46 text-finder:1.26 timestamper:1.26 token-macro:384.vf35b_f26814ec trilead-api:2.84.v72119de229b_7 variant:60.v7290fc0eb_b_cd versioncolumn:210.v94a_dca_868138 vsphere-cloud:2.27 warnings-ng:10.5.0 workflow-aggregator:596.v8c21c963d92d workflow-api:1283.v99c10937efcb_ workflow-basic-steps:1042.ve7b_140c4a_e0c workflow-cps:3802.vd42b_fcf00b_a_c workflow-durable-task-step:1289.v4d3e7b_01546b_ workflow-job:1348.v32a_a_f150910e workflow-multibranch:756.v891d88f2cd46 workflow-scm-step:415.v434365564324 workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:865.v43e78cc44e0d ws-cleanup:0.45 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Jenkins is Ubuntu 22.04. I'm launching windows 10 VMs using JNLP to connect. Number of Executors is 1. Retention time is 0. Connection type is JNLP.

Reproduction steps

  1. Run a job that requires a node/agent that uses JNLP
  2. Watch as it launches a new node/agent every minute or so until the builder requirement is met
  3. Since retention time is 0 all launched VMs will now just continue to run until they are used. Setting retention time to something else (like 1) will see the VMs get killed (after 10 minutes or so) but also introduces the possibility that a node/agent will get re-used which I never want.

Expected Results

A single node/agent/VM gets launched and jenkins waits for it to connect.

Actual Results

A LOT of VMs get launched. On my system windows nodes take around 4 minutes to get online far enough to establish a JNLP connection. During this time multiple VMs get launched. This is exacerbated if multiple windows nodes/agents are requested then the problem multiplies.

Anything else?

This is new behavior in 2.64. I expect related to commit: https://github.com/jenkinsci/openstack-cloud-plugin/commit/80b6780b178666ca707f98ba5c32e350857256e4

steveames commented 8 months ago

This behavior is present in 2.65 as well. I keep having to revert back to 2.63 which is unfortunate as I think this prevents updating the underlying jenkins core since it removes prototype.js (I may have that backward).

mdonahoe-cisco commented 7 months ago

We're seeing the same issue. We're launching windows nodes via openstack cloud plugin + UserData script that initiates JNLP connection. When the plugin first brings up a node in response to a pipeline job, Jenkins shows that the node is offline, likely because the UserData script has not run yet and the JNLP connection has not yet been established. The openstack plugin will continue to spin up new nodes until max instances for the template is reached, or until the JNLP connection is established an an executor has been selected for the pipeline job.

In addition, on the node where the JNLP connection succeeds, the openstack plugin fails to remove the node once it is idle. It seems to ignore the retention time config in the template.

steveames commented 7 months ago

I have confirmed that this behavior is due to https://github.com/jenkinsci/openstack-cloud-plugin/commit/80b6780b178666ca707f98ba5c32e350857256e4 . I backed out the change to the isWaitingFor function and the plugin no longer launches a ton of extra VMs. The why is a little less clear. It should only return null if terminated is set to true and yet I'm pretty sure that's what's causing this. If it returns null then provisionSlave returns the slave rather than waiting for it to actually be ready-ish.

The commit message didn't make a lot of sense to me. I don't see any negatives to reverting it but there must have been a reason. More research and probably a better fix is required. However if you're willing to compile it yourself just back out the above change and this behavior goes away.

Side note... while looking for alternate causes I found this bit of code in plugin/src/main/java/jenkins/plugins/openstack/compute/JCloudsCloud.java around line 283. I'm like missing something but I can't undestand that this for loop is actually doing anything other than potentially adding to the queue, unecessarily, multiple times? I took out the for loop and it didn't seem to cause any problems. ymmv

            for (int i = 0; i < templateCapacity; i++) {
                int size = queue.size();
                if (size >= globalCapacity || size >= excessWorkload) return queue;

                queue.add(t);
            }
steveames commented 7 months ago

We're seeing the same issue. We're launching windows nodes via openstack cloud plugin + UserData script that initiates JNLP connection.

Hi @mdonahoe-cisco . Total side topic. Any chance you could share your UserData script (or barebones of it)? I have never managed to get that working and ended up creating a scheduled task on the VM that launches JNLP on startup. That is, obviously, very hard to maintain as it requires image updates. Would much rather use UserData if I could get it working! TIA.

mdonahoe-cisco commented 7 months ago

@steveames I think there are workarounds for a few different things in here.. Can't remember exactly.. The strangest thing is parsing the node name from the SLAVE_JAR_URL so that we can pass it to hudson.remoting.jnlp.Main. A bit ugly but hope it helps.

rem cmd
@echo on
curl ${SLAVE_JAR_URL} -o C:\Users\cloudbase-init\agent.jar

REM GET THE NODE NAME FROM THE URL
REM e.g. https://example.com/jenkins/myNamespace/computer/windows-reg-test-7276/slave-agent.jnlp
setlocal EnableDelayedExpansion
set url=${SLAVE_JNLP_URL}

REM Replace '/' with ' ' and split into array
set i=0
for %%a in (%url:/= %) do (
    set /a i+=1
    set "part[!i!]=%%a"
)

REM Get the last and last elements
set /a lastIndex=i-1

set "nodename=!part[%lastIndex%]!"

REM Output the results
REM e.g. windows-reg-test-7276
echo nodename: %nodename%

java -cp C:\Users\cloudbase-init\agent.jar hudson.remoting.jnlp.Main -url ${JENKINS_URL} -webSocket -workDir C:\jenkins_workspace -headless ${SLAVE_JNLP_SECRET} %nodename%
endlocal