jenkinsci / hetzner-cloud-plugin

Hetzner cloud integration for Jenkins
https://plugins.jenkins.io/hetzner-cloud/
Apache License 2.0
23 stars 8 forks source link

Remove idle servers just before billing hour completion does not always work, servers stay around forever #48

Closed sandrinr closed 1 year ago

sandrinr commented 1 year ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.346.2 OS: Linux - 5.4.0-66-generic --- ace-editor:1.1 analysis-model-api:10.13.0 ansicolor:1.0.2 ant:475.vf34069fef73c antisamy-markup-formatter:2.7 apache-httpcomponents-client-4-api:4.5.13-138.v4e7d9a_7b_a_e61 authentication-tokens:1.4 blueocean:1.25.5 blueocean-autofavorite:1.2.5 blueocean-bitbucket-pipeline:1.25.5 blueocean-commons:1.25.5 blueocean-config:1.25.5 blueocean-core-js:1.25.5 blueocean-dashboard:1.25.5 blueocean-display-url:2.4.1 blueocean-events:1.25.5 blueocean-git-pipeline:1.25.5 blueocean-github-pipeline:1.25.5 blueocean-i18n:1.25.5 blueocean-jira:1.25.5 blueocean-jwt:1.25.5 blueocean-personalization:1.25.5 blueocean-pipeline-api-impl:1.25.5 blueocean-pipeline-editor:1.25.5 blueocean-pipeline-scm-api:1.25.5 blueocean-rest:1.25.5 blueocean-rest-impl:1.25.5 blueocean-web:1.25.5 bootstrap4-api:4.6.0-5 bootstrap5-api:5.1.3-7 bouncycastle-api:2.26 branch-api:2.1046.v0ca_37783ecc5 build-blocker-plugin:1.7.8 build-monitor-plugin:1.13+build.202205140447 build-timeout:1.21 caffeine-api:2.9.3-65.v6a_47d0f4d1fe categorized-view:1.12 checks-api:1.7.4 claim:2.18.2 cloud-stats:0.27 cloudbees-bitbucket-branch-source:784.v7fcdc7c670f6 cloudbees-folder:6.736.v5f554b_b_a_52b_3 cobertura:1.17 code-coverage-api:3.0.4 command-launcher:84.v4a_97f2027398 conditional-buildstep:1.4.2 config-file-provider:3.11.1 copyartifact:1.46.4 credentials:1139.veb_9579fca_33b_ credentials-binding:523.vd859a_4b_122e6 dark-theme:185.v276b_5a_8966a_e dashboard-view:2.447.vda_124dd35f11 data-tables-api:1.12.1-2 display-url-api:2.3.6 docker-commons:1.19 docker-compose-build-step:1.0 docker-custom-build-environment:1.7.3 docker-java-api:3.2.13-37.vf3411c9828b9 docker-plugin:1.2.9 docker-slaves:1.0.7 docker-workflow:1.29 downstream-build-cache:1.7 durable-task:496.va67c6f9eefa7 echarts-api:5.3.3-1 email-ext:2.91 envinject-api:1.199.v3ce31253ed13 extended-read-permission:3.2 external-monitor-job:192.ve979ca_8b_3ccd extra-columns:1.25 favorite:2.4.1 font-awesome-api:6.1.1-1 forensics-api:1.15.1 git:4.11.4 git-client:3.11.1 git-server:1.11 github:1.34.5 github-api:1.303-400.v35c2d8258028 github-branch-source:1677.v731f745ea_0cf gitlab-api:5.0.1-78.v47a_45b_9f78b_7 gitlab-branch-source:633.ved9984f943da_ gitlab-plugin:1.5.35 gradle:1.39.4 greenballs:1.15.1 handlebars:3.0.8 handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953 heavy-job:1.1 hetzner-cloud:60.vf7c9ee657e67 htmlpublisher:1.30 jackson2-api:2.13.3-285.vc03c0256d517 jacoco:3.3.2 jakarta-activation-api:2.0.0-3 jakarta-mail-api:2.0.0-6 javadoc:217.v905b_86277a_2a_ javax-activation-api:1.2.0-4 javax-mail-api:1.6.2-7 jaxb:2.3.6-1 jdk-tool:1.5 jenkins-design-language:1.25.5 jersey2-api:2.36-2 jira:3.7.1 jjwt-api:0.11.5-77.v646c772fddb_0 jnr-posix-api:3.1.15-1 job-import-plugin:3.5 jobConfigHistory:1163.ve82c7c6e60a_3 jquery:1.12.4-1 jquery-detached:1.2.1 jquery3-api:3.6.0-4 jsch:0.1.55.2 junit:1119.1121.vc43d0fc45561 ldap:2.11 lockable-resources:2.16 mailer:435.v79ef3972b_5c7 mapdb-api:1.0.9.0 matrix-auth:3.1.5 matrix-project:772.v494f19991984 maven-plugin:3.19 metrics:4.2.10-389.v93143621b_050 mina-sshd-api-common:2.8.0-36.v8e25ce90d4b_1 mina-sshd-api-core:2.8.0-36.v8e25ce90d4b_1 momentjs:1.1.1 naginator:1.18.1 nodelabelparameter:1.11.0 okhttp-api:4.9.3-105.vb96869f8ac3a pam-auth:1.8 parameterized-scheduler:1.0 parameterized-trigger:2.45 pipeline-build-step:2.18 pipeline-graph-analysis:195.v5812d95a_a_2f9 pipeline-groovy-lib:598.vcd66b_a_336510 pipeline-input-step:449.v77f0e8b_845c4 pipeline-milestone-step:101.vd572fef9d926 pipeline-model-api:2.2114.v2654ca_721309 pipeline-model-definition:2.2114.v2654ca_721309 pipeline-model-extensions:2.2114.v2654ca_721309 pipeline-stage-step:293.v200037eefcd5 pipeline-stage-tags-metadata:2.2114.v2654ca_721309 pipeline-utility-steps:2.13.0 plain-credentials:1.8 plot:2.1.10 plugin-util-api:2.17.0 popper-api:1.16.1-3 popper2-api:2.11.5-2 postbuildscript:3.1.0-375.v3db_cd92485e1 prism-api:1.28.0-2 promoted-builds:878.v12d3f7937690 publish-over:0.22 pubsub-light:1.16 python:1.3 resource-disposer:0.19 run-condition:1.5 scm-api:616.ve67136f6c77d script-security:1175.v4b_d517d6db_f0 scriptler:3.5 slack:616.v03b_1e98d13dd snakeyaml-api:1.30.2-76.vc104f7ce9870 sse-gateway:1.25 ssh-agent:295.v9ca_a_1c7cc3a_a_ ssh-credentials:295.vced876c18eb_4 ssh-slaves:1.834.v622da_57f702c sshd:3.242.va_db_9da_b_26a_c3 structs:324.va_f5d6774f3a_d subversion:2.15.5 text-finder:1.19 theme-manager:1.4 thinBackup:1.10 throttle-concurrents:2.8 timestamper:1.18 token-macro:308.v4f2b_ed62b_b_16 translation:1.16 trilead-api:1.67.vc3938a_35172f variant:59.vf075fe829ccb view-job-filters:2.3 warnings-ng:9.14.0 windows-slaves:1.8.1 workflow-api:1188.v0016b_4f29881 workflow-basic-steps:986.v6b_9c830a_6b_37 workflow-cps:2759.v87459c4eea_ca_ workflow-durable-task-step:1199.v02b_9244f8064 workflow-job:1207.ve6191ff089f8 workflow-multibranch:716.vc692a_e52371b_ workflow-scm-step:400.v6b_89a_1317c9a_ workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:833.va_1c71061486b_ ws-cleanup:0.42 yet-another-build-visualizer:1.16 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: Ubuntu 20.04 Agent: Ubuntu 20.04

Reproduction steps

I don't know how to reproduce the problem. It does not happen in a robust way.

We have two server templates configured. Both have their shutdown policies set to "Removes idle server just before current hour of billing cycle completes".

Expected Results

All the servers spawned by Jenkins should be torn down if they are idle around the time when the billing cycle completes.

Actual Results

Servers stay around forever. It seems to happen either to all current servers or none of them. I don't think is is related to Jenkins restarts (anymore) (#36) as for the instances shown below all servers were created after Jenkins was restarted and it was up the whole time. Also, when restarting, all Servers get removed as it is expected after fixing of #36.

image

Looking at a concrete instance, one can see it was not used for quite some time.

image

Anything else?

The shutdown policy "Removes server after its idle for period of time" seems to work as expected.

rkosegi commented 1 year ago

Hi @sandrinr , I have trouble to replicate this problem. Could you check logs if there is anything suspicious?

sandrinr commented 1 year ago

Hi @rkosegi

I know it is hard to reproduce. I enabled the feature again this morning and until now it is working as expected. I experienced it working before and then at some point it stops working. Normally for all nodes.

I crawled the logs a bit for the node mentioned in the screenshots above. I can see that Jenkins was very overloaded during that day.

I see a lot of errors such as:

2022-07-28 22:55:14.026+0000 [id=63870] SEVERE  hudson.plugins.plot.CSVSeries#loadSeries: Exception trying to retrieve series files
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to hcloud-1xj5amqzz3f15wag
                at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1784)
                at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
                at hudson.remoting.Channel.call(Channel.java:1000)
                at hudson.FilePath.act(FilePath.java:1194)
                at hudson.FilePath.act(FilePath.java:1183)
                at hudson.FilePath.list(FilePath.java:2133)
                at hudson.FilePath.list(FilePath.java:2116)
                at hudson.FilePath.list(FilePath.java:2100)
                at hudson.plugins.plot.CSVSeries.loadSeries(CSVSeries.java:149)
                at hudson.plugins.plot.Plot.addBuild(Plot.java:720)
                at hudson.plugins.plot.PlotBuilder.perform(PlotBuilder.java:235)
                at jenkins.tasks.SimpleBuildStep.perform(SimpleBuildStep.java:123)
                at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:101)
                at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:71)
                at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
                at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
                at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
                at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
                at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
                at java.base/java.lang.Thread.run(Thread.java:834)
java.io.IOException: Expecting Ant GLOB pattern, but saw '/replaced/path/to/jenkins-something.csv'. See https://ant.apache.org/manual/Types/fileset.html for syntax
        at hudson.FilePath.glob(FilePath.java:2170)
        at hudson.FilePath.access$2700(FilePath.java:212)
        at hudson.FilePath$ListGlob.invoke(FilePath.java:2151)
        at hudson.FilePath$ListGlob.invoke(FilePath.java:2136)
        at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3502)
        at hudson.remoting.UserRequest.perform(UserRequest.java:211)
        at hudson.remoting.UserRequest.perform(UserRequest.java:54)
        at hudson.remoting.Request$2.run(Request.java:376)
        at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

I don't think they are related.

The last thing I see in the logs regarding that node is:

2022-07-29 07:02:11.103+0000 [id=19331] INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel hcloud-1xj5amqzz3f15wag
java.io.EOFException
        at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2872)
        at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3367)
        at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936)
        at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:379)
        at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
        at hudson.remoting.Command.readFrom(Command.java:142)
        at hudson.remoting.Command.readFrom(Command.java:128)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected termination of the channel
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)

Do you want me to look for something in particular?

rkosegi commented 1 year ago

Do you want me to look for something in particular?

That's the thing, no idea where to start. I was hoping to see some error from around this code

if (c.isIdle() && agent != null && agent.getServerInstance() != null) {
    if (Helper.canShutdownServer(agent.getServerInstance().getServerDetail().getCreated(),
            LocalDateTime.now().getMinute())) {
        log.info("Disconnecting {}", c.getName());
        try {
            agent.terminate();
        } catch (InterruptedException | IOException e) {
            log.warn("Failed to terminate {}", c.getName(), e);
        }
    }
}

because that's where decission about node termination happens

sandrinr commented 1 year ago

I checked the logs for that code.

I find zero occurrences of Failed to terminate. I also found zero occurrences of Disconnecting hcloud-1xj5amqzz3f15wag (I can see it for successfully terminated hcloud nodes). So this code is never called or these if-statements do not evaluate to true for the nodes that stay forever.

rkosegi commented 1 year ago

Here is an idea to eliminate some of conditions. If you encounter issue again, you can open server details under Manage jenkins => Manage nodes and clouds => hcloud-something....

If it says, No details available, then it means that agent.getServerInstance() == null which would be pretty weird.

Otherwise, it could only be caused by either c.isIdle() == false (which shouldn't happen based on graph you shared) or Helper.canShutdownServer(...) is returning false which I doubt as it is unit-tested with all possible combinations of inputs.

rkosegi commented 1 year ago

One another way would be executing following groovy script in console (Manage jenkins => Script console)

import java.time.LocalDateTime
import cloud.dnation.jenkins.plugins.hetzner.Helper

Jenkins.get().getComputers().findAll{ server -> server.class.name.contains("Hetzner")}.each {computer ->
  println(computer.isIdle())
  println(computer.getNode().getServerInstance())
  println(Helper.canShutdownServer(computer.getNode().getServerInstance().getServerDetail().getCreated(),
                        LocalDateTime.now().getMinute()))
  println("-------------")
}

That will give us exact information needed

toabi commented 1 year ago

I have a similar feeling… that it's kept around much long than necessary. Here the output:

true
HetznerServerInfo(sshKeyDetail=SshKeyDetail(created=2021-10-20T14:14:10+00:00, fingerprint=85:ab:ca:0f:48:47:3a:ed:b3:79:6d:57:3b:01:7a:c3, labels={jenkins.io/managed-by=hetzner-jenkins-plugin, jenkins.io/credentials-id=hcloud-ssh-private-key}, name=hcloud-ssh-private-key, publicKey=ssh-rsa YYY), serverDetail=ServerDetail(name=hcloud-bb5ypntw60jwrkms, status=running, created=2022-08-08T11:03:55+00:00, publicNet=PublicNetDetail(ipv4=Ipv4Detail(blocked=false, dnsPtr=static.XXX.clients.your-server.de, ip=XXX)), privateNet=[], serverType=ServerType(name=cpx41, description=CPX 41, cores=8, memory=16, disk=240), datacenter=DatacenterDetail(name=nbg1-dc3, description=Nuremberg 1 DC 3, location=LocationDetail(name=nbg1, description=Nuremberg DC Park 1, country=DE, city=Nuremberg)), image=ImageDetail(type=snapshot, status=available, name=null, description=jenkins)))
false
-------------
Result: [cloud.dnation.jenkins.plugins.hetzner.HetznerServerComputer@19c55552]

graph

rkosegi commented 1 year ago

Thanks @toabi, script output which you provided confirmed bug in Helper.canShutdownServer(...)

Will fix soon.

sandrinr commented 1 year ago

Somehow this does not yet seem to work on our side.

We have some nodes up for several days now. Some of them did not get any load for a long time.

When playing around with your debug script above I am positive the the Helper.canShutdownServer() returns true between 55 and 59 minutes when diffing the creation time from the current time mod 60 minutes. computer.isIdle() also returns true.

However, can we be sure the method BeforeHourWrapsPolicy.check() is always called. In our case the nodes were inactive so long that Jenkins put them into offline state.

image

I used the following script:

import java.time.LocalDateTime
import cloud.dnation.jenkins.plugins.hetzner.Helper
import java.time.format.DateTimeFormatter
import java.time.ZoneOffset
import java.time.Duration;

Jenkins.get().getComputers().findAll{ server -> server.class.name.contains("Hetzner")}.each {computer ->
  println(computer.isIdle())
  println(computer.getNode().getServerInstance())
  final String createdStr = computer.getNode().getServerInstance().getServerDetail().getCreated()
  println(createdStr)
  final LocalDateTime created = LocalDateTime.from(DateTimeFormatter.ISO_DATE_TIME.parse(createdStr))
                .atOffset(ZoneOffset.UTC).toLocalDateTime();
  println(created)
  final LocalDateTime currentTime = LocalDateTime.now()
  println(Duration.between(created, currentTime.atOffset(ZoneOffset.UTC).toLocalDateTime()).toMinutes() % 60)
  println(Helper.canShutdownServer(createdStr, currentTime))
  println("-------------")
}

With that script I get for example

true
HetznerServerInfo(sshKeyDetail=SshKeyDetail(...)), serverDetail=ServerDetail(name=hcloud-kkdg8f8jk5w1xh74, status=running, created=2022-08-15T08:36:32+00:00, publicNet=PublicNetDetail(...), privateNet=[], serverType=ServerType(...), datacenter=DatacenterDetail(...), image=ImageDetail(...)))
2022-08-15T08:36:32+00:00
2022-08-15T08:36:32
56
true

But the server is not removed.

rkosegi commented 1 year ago

Hi @sandrinr this was my concern very early when this feature was drafted - see point 2.

I'm not aware of any way to change that behavior, it's part of Jenkins that calling our code and we make decision at that point. If we're not called, there is nothing to decide (and thus no way to terminate at right time).

Maybe widening of safety buffer (currently 5mins) would help to mitigate issue, what do you think? Otherwise, I'm pretty out of ideas.

rkosegi commented 1 year ago

However, can we be sure the method BeforeHourWrapsPolicy.check() is always called. In our case the nodes were inactive so long that Jenkins put them into offline state.

maybe this is another issue. I guess that retention strategy is not called because agent is offline.

sandrinr commented 1 year ago

It might have something to do with the agent being offline. However, the agent is going offline because it it is not used for a long time.

Currently, for example, no agents are offline, but nodes are still not destroyed. The debug script outputs:

true
HetznerServerInfo(sshKeyDetail=SshKeyDetail(...), serverDetail=ServerDetail(name=hcloud-pm3hkdkkjcbatbje, status=running, created=2022-08-19T08:27:40+00:00, publicNet=PublicNetDetail(...), privateNet=[], serverType=ServerType(...), datacenter=DatacenterDetail(...), image=ImageDetail(...)))
2022-08-19T08:27:40+00:00
2022-08-19T08:27:40
59
true

Additional info: