Hetzner agents are (still) not cleaned up every time

sandrinr commented 1 year ago

Jenkins and plugins versions report

Environment

```text Jenkins: 2.361.1 OS: Linux - 5.4.0-105-generic --- ace-editor:1.1 analysis-model-api:10.17.0 ansicolor:1.0.2 antisamy-markup-formatter:2.7 apache-httpcomponents-client-4-api:4.5.13-138.v4e7d9a_7b_a_e61 authentication-tokens:1.4 bootstrap5-api:5.2.1-2 bouncycastle-api:2.26 branch-api:2.1046.v0ca_37783ecc5 build-blocker-plugin:1.7.8 build-monitor-plugin:1.13+build.202205140447 caffeine-api:2.9.3-65.v6a_47d0f4d1fe checks-api:1.7.5 cloud-stats:0.27 cloudbees-folder:6.758.vfd75d09eea_a_1 cobertura:1.17 code-coverage-api:3.1.0 command-launcher:90.v669d7ccb_7c31 copyartifact:1.47 credentials:1189.vf61b_a_5e2f62e credentials-binding:523.vd859a_4b_122e6 dark-theme:245.vb_a_2b_b_010ea_96 data-tables-api:1.12.1-3 display-url-api:2.3.6 docker-commons:1.21 docker-workflow:521.v1a_a_dd2073b_2e downstream-build-cache:1.7 durable-task:500.v8927d9fd99d8 echarts-api:5.3.3-1 extended-read-permission:3.2 font-awesome-api:6.2.0-3 forensics-api:1.16.0 git:4.12.1 git-client:3.12.0 git-server:99.va_0826a_b_cdfa_d gitlab-api:5.0.1-78.v47a_45b_9f78b_7 gitlab-branch-source:640.v7101b_1c0def9 gitlab-plugin:1.5.35 handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953 hetzner-cloud:66.v5cb45cbd155e instance-identity:116.vf8f487400980 ionicons-api:28.va_f3a_84439e5f jackson2-api:2.13.3-285.vc03c0256d517 jacoco:3.3.2 jakarta-activation-api:2.0.1-1 jakarta-mail-api:2.0.1-1 javax-activation-api:1.2.0-4 javax-mail-api:1.6.2-7 jaxb:2.3.6-1 jdk-tool:55.v1b_32b_6ca_f9ca jersey2-api:2.37-1 jquery3-api:3.6.1-1 jsch:0.1.55.61.va_e9ee26616e7 junit:1143.v8d9a_e3355270 lockable-resources:2.16 mailer:438.v02c7f0a_12fa_4 matrix-project:785.v06b_7f47b_c631 mina-sshd-api-common:2.9.1-44.v476733c11f82 mina-sshd-api-core:2.9.1-44.v476733c11f82 parameterized-trigger:2.45 pipeline-build-step:2.18 pipeline-groovy-lib:612.v84da_9c54906d pipeline-input-step:451.vf1a_a_4f405289 pipeline-milestone-step:101.vd572fef9d926 pipeline-model-api:2.2114.v2654ca_721309 pipeline-model-definition:2.2114.v2654ca_721309 pipeline-model-extensions:2.2114.v2654ca_721309 pipeline-stage-step:296.v5f6908f017a_5 pipeline-stage-tags-metadata:2.2114.v2654ca_721309 pipeline-utility-steps:2.13.0 plain-credentials:139.ved2b_9cf7587b plot:2.1.11 plugin-util-api:2.17.0 popper2-api:2.11.6-1 prism-api:1.29.0-1 role-strategy:561.v9846c7351a_41 saml:4.352.vb_722786ea_79d scm-api:621.vda_a_b_055e58f7 script-security:1183.v774b_0b_0a_a_451 scriptler:3.5 snakeyaml-api:1.32-86.ve3f030a_75631 ssh-agent:295.v9ca_a_1c7cc3a_a_ ssh-credentials:305.v8f4381501156 ssh-slaves:2.846.v1b_70190624f5 sshd:3.249.v2dc2ea_416e33 structs:324.va_f5d6774f3a_d theme-manager:1.5 timestamper:1.20 token-macro:308.v4f2b_ed62b_b_16 trilead-api:2.72.v2a_3236754f73 variant:59.vf075fe829ccb warnings-ng:9.20.0 workflow-aggregator:590.v6a_d052e5a_a_b_5 workflow-api:1192.v2d0deb_19d212 workflow-basic-steps:994.vd57e3ca_46d24 workflow-cps:2801.vf82a_b_b_e3e8a_5 workflow-durable-task-step:1199.v02b_9244f8064 workflow-job:1239.v71b_b_a_124a_725 workflow-multibranch:716.vc692a_e52371b_ workflow-scm-step:400.v6b_89a_1317c9a_ workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:838.va_3a_087b_4055b yet-another-build-visualizer:1.16 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Ubuntu 20.04 as the host, Jenkins is running inside a Docker container, using the image jenkins/jenkins:lts.

Reproduction steps

Unclear...

Observe agents being created and deleted according to the expectation of the configured shutdown policy
Wait some time, maybe restart Jenkins, ...unclear
Observe agents being not being deleted anymore, even though the configured shutdown policy would mandate it. to the expectation of the configured shutdown policy

Expected Results

Agents are always deleted according to the configured shutdown policy.

Actual Results

After a while, agents are not deleted anymore but are used basically as if they would be static agents.

Anything else?

After manually deleting those stale agents in Jenkins, Jenkins creates new ones when there is demand and there is a chance these new agents actually are being cleaned up again according to the configured shutdown policy.

Example:

Consider the following screenshot from our Jenkins' cloud statistics:

In this screenshot you can see three areas: blue, red and purple.

The purple area denotes the agents that were created and removed according to the expectation of the shutdown policy
The red area denotes agents that were not shut down automatically and were deleted manually
The blue area denotes an agent that was created after the manual clean up and that was deleted automatically according to the expectation of the shutdown policy.

Maybe important additional information:

The Docker container was restarted (re-created) on 2022-10-01 14:24:33 which is in between of the purple and the red area. So basically after the Jenkins container was recreated, the shutdown policy did not work anymore.
In the logs I don't see any log message of the Hetzner plugin between 2022-10-01 21:20:42 and 2022-10-05 07:33:00 (basically during the agents were stale, i.e. the red area)
What I see a lot in the log is an exception of the form [0]. However, I think this is exception is caused by maybe a bug in another plugin. It also happens during times when the Hetzner plugin deletes agents as expected.

[0]

```text 2022-10-04 22:28:49.554+0000 [id=108311] SEVERE hudson.plugins.plot.CSVSeries#loadSeries: Exception trying to retrieve series files Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to hcloud-ttvgwgquximu2tex at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1784) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) at hudson.remoting.Channel.call(Channel.java:1000) at hudson.FilePath.act(FilePath.java:1186) at hudson.FilePath.act(FilePath.java:1175) at hudson.FilePath.list(FilePath.java:2125) at hudson.FilePath.list(FilePath.java:2108) at hudson.FilePath.list(FilePath.java:2092) at hudson.plugins.plot.CSVSeries.loadSeries(CSVSeries.java:149) at hudson.plugins.plot.Plot.addBuild(Plot.java:720) at hudson.plugins.plot.PlotBuilder.perform(PlotBuilder.java:235) at jenkins.tasks.SimpleBuildStep.perform(SimpleBuildStep.java:123) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:101) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:71) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) java.io.IOException: Expecting Ant GLOB pattern, but saw '/home/jenkins/workspace/vices_infrastructuretest_SA-8466/jenkins-sgapplianceInfrastructureTesttest.csv'. See https://ant.apache.org/manual/Types/fileset.html for syntax at hudson.FilePath.glob(FilePath.java:2162) at hudson.FilePath$ListGlob.invoke(FilePath.java:2143) at hudson.FilePath$ListGlob.invoke(FilePath.java:2128) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3492) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:376) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ```

rkosegi commented 1 year ago

Hi @sandrinr, thank you for bug report. I'm afraid this looks like what we have seen in past and we have no fix available for.

Can you please provide:

what's configured shutdown policy on affected agents
output of this script executed in script console

sandrinr commented 1 year ago

Hi @rkosegi, thanks for picking it up.

Now, it happened again. Again after a restart of the Jenkins Docker container. Again the first node(s) spawned did not get shut down when they should be.

The output of the script

``` true HetznerServerInfo(sshKeyDetail=SshKeyDetail(created=2022-03-23T06:14:25+00:00, fingerprint=31:b3:c1:4f:d4:6b:5f:97:b4:50:f1:1f:3e:4d:49:7c, labels={jenkins.io/managed-by=hetzner-jenkins-plugin, jenkins.io/credentials-id=..}, name=..., publicKey=...), serverDetail=ServerDetail(name=hcloud-jdawffnkmzmsg2l5, status=running, created=2022-10-09T03:48:11+00:00, publicNet=PublicNetDetail(), privateNet=[], serverType=ServerType(name=cpx41, description=CPX 41, cores=8, memory=16, disk=240), datacenter=DatacenterDetail(name=nbg1-dc3, description=Nuremberg 1 DC 3, location=LocationDetail(name=nbg1, description=Nuremberg DC Park 1, country=DE, city=Nuremberg)), image=ImageDetail(type=snapshot, status=available, name=null, description=Generated by jenkins_node_hetzner_cloud_image.jenkins))) true ------------- Result: [cloud.dnation.jenkins.plugins.hetzner.HetznerServerComputer@2ede2248] ```

As I see it that node should be shut down (all conditions are met). But somehow, I think, the code to decide whether to shut the node down is never called.

sandrinr commented 1 year ago

Adding another datapoint: The issue was currently happening for us all the time. There were no automatic deletion of agents anymore. Also restarts did not help.

What I did now, is to go to the Cloud settings and change the shutdown policy to time out based, save and change it back to, billing cycle based and saved. Now agents were automatically deleted as expected. My guess is that it will now work for a while until it stops working again.

sandrinr commented 1 year ago

@rkosegi I think I can share some updates.

Recently the issue came up again. I started to dig and made the following discovery.

I wanted to see the actual retention strategy active on the Hetzner cloud nodes. For that I ran the following script:

def hetznerComputers = Jenkins.instance.computers.findAll {it instanceof cloud.dnation.jenkins.plugins.hetzner.HetznerServerComputer}
hetznerComputers.each { println("${it.name}: ${it.retentionStrategy}") }

The output of the script during the time when nodes where not shutdown correctly was:

hcloud-gdazcgia5at0q8dq: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-i6hclb7t28kxuabz: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-i9e1nplpjyl5eh3b: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-jjvddplmktenjx9h: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-qwlivqfkvtxulvpl: hudson.slaves.RetentionStrategy$Always@10fad782
hcloud-ul9dz5soavq4qnf5: hudson.slaves.RetentionStrategy$Always@10fad782

Also, when deleting all nodes manually and triggering a build that will trigger the creation of a new node, the retention strategy will be the same:

hcloud-rnkbnep9dcpisyyb: hudson.slaves.RetentionStrategy$Always@10fad782

Actually, even restarting Jenkins did not change this. After a restart and triggering a node creation it was the same:

hcloud-whjnixzynhf653qf: hudson.slaves.RetentionStrategy$Always@7f2c4f8b

I was able to fix the issue by changing some setting in the Hetzner cloud settings, in my case a changed the max number of allowed nodes, after that (and without Jenkins restart), when triggering the creation of a new node its retention policy was what I expected:

hcloud-j5pxincalzklmnox: cloud.dnation.jenkins.plugins.hetzner.shutdown.BeforeHourWrapsPolicy$RetentionStrategyImpl@5d46de5c

And also the node was shut down as expected by the retention strategy.

So the issue seems to be that after some point all newly created nodes get the retention policy Always assigned and as such are never shut down anymore.

I was unable to find out where this is coming from. Even when checking the main Jenkins config.xml the configured retention strategy was

<shutdownPolicy class="cloud.dnation.jenkins.plugins.hetzner.shutdown.BeforeHourWrapsPolicy"/>

Since the issue seems to persist even during restarts, this needs to be persisted somewhere...

Also, I have no idea what is causing the nodes to have the retention policy Always assigned in the first place.

rkosegi commented 1 year ago

Hi @sandrinr, this is very interesting finding. I will try to find some time soon to look into that.

rkosegi commented 1 year ago

@sandrinr I hope I fixed root cause, but I want to emphasize that when you restart Jenkins controller, plugin assumes that no servers were running.

sandrinr commented 1 year ago

@rkosegi We have the fix in operation for a while now and never experienced the issue again. Thank you very much for the fix 🙏 .

rkosegi commented 1 year ago

@sandrinr that are great news, thanks for confirmation.

jenkinsci / hetzner-cloud-plugin