Remove idle servers just before billing hour completion does not always work, servers stay around forever #48

Closed sandrinr closed 1 year ago

sandrinr commented 1 year ago

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: Ubuntu 20.04 Agent: Ubuntu 20.04

Reproduction steps

I don't know how to reproduce the problem. It does not happen in a robust way.

We have two server templates configured. Both have their shutdown policies set to "Removes idle server just before current hour of billing cycle completes".

Expected Results

All the servers spawned by Jenkins should be torn down if they are idle around the time when the billing cycle completes.

Actual Results

Servers stay around forever. It seems to happen either to all current servers or none of them. I don't think is is related to Jenkins restarts (anymore) (#36) as for the instances shown below all servers were created after Jenkins was restarted and it was up the whole time. Also, when restarting, all Servers get removed as it is expected after fixing of #36.


Looking at a concrete instance, one can see it was not used for quite some time.


Anything else?

The shutdown policy "Removes server after its idle for period of time" seems to work as expected.

rkosegi commented 1 year ago

Hi @sandrinr , I have trouble to replicate this problem. Could you check logs if there is anything suspicious?

sandrinr commented 1 year ago

Hi @rkosegi

I know it is hard to reproduce. I enabled the feature again this morning and until now it is working as expected. I experienced it working before and then at some point it stops working. Normally for all nodes.

I crawled the logs a bit for the node mentioned in the screenshots above. I can see that Jenkins was very overloaded during that day.

I see a lot of errors such as:

2022-07-28 22:55:14.026+0000 [id=63870] SEVERE  hudson.plugins.plot.CSVSeries#loadSeries: Exception trying to retrieve series files
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to hcloud-1xj5amqzz3f15wag
                at hudson.remoting.Channel.attachCallSiteStackTrace(
                at hudson.remoting.UserRequest$ExceptionResponse.retrieve(
                at hudson.FilePath.act(
                at hudson.FilePath.act(
                at hudson.FilePath.list(
                at hudson.FilePath.list(
                at hudson.FilePath.list(
                at hudson.plugins.plot.CSVSeries.loadSeries(
                at hudson.plugins.plot.Plot.addBuild(
                at hudson.plugins.plot.PlotBuilder.perform(
                at jenkins.tasks.SimpleBuildStep.perform(
                at org.jenkinsci.plugins.workflow.steps.CoreStep$
                at org.jenkinsci.plugins.workflow.steps.CoreStep$
                at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(
                at java.base/java.util.concurrent.Executors$
                at java.base/
                at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(
                at java.base/java.util.concurrent.ThreadPoolExecutor$
                Expecting Ant GLOB pattern, but saw '/replaced/path/to/jenkins-something.csv'. See for syntax
        at hudson.FilePath.glob(
        at hudson.FilePath.access$2700(
        at hudson.FilePath$ListGlob.invoke(
        at hudson.FilePath$ListGlob.invoke(
        at hudson.FilePath$
        at hudson.remoting.UserRequest.perform(
        at hudson.remoting.UserRequest.perform(
        at hudson.remoting.Request$
        at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(
        at java.base/
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.base/java.util.concurrent.ThreadPoolExecutor$
        at java.base/

I don't think they are related.

The last thing I see in the logs regarding that node is:

2022-07-29 07:02:11.103+0000 [id=19331] INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel hcloud-1xj5amqzz3f15wag
        at java.base/$PeekInputStream.readFully(
        at java.base/$BlockDataInputStream.readShort(
        at java.base/
        at java.base/<init>(
        at hudson.remoting.ObjectInputStreamEx.<init>(
        at hudson.remoting.Command.readFrom(
        at hudson.remoting.Command.readFrom(
        at hudson.remoting.SynchronousCommandTransport$
Caused: Unexpected termination of the channel
        at hudson.remoting.SynchronousCommandTransport$

Do you want me to look for something in particular?

rkosegi commented 1 year ago

Do you want me to look for something in particular?

That's the thing, no idea where to start. I was hoping to see some error from around this code

if (c.isIdle() && agent != null && agent.getServerInstance() != null) {
    if (Helper.canShutdownServer(agent.getServerInstance().getServerDetail().getCreated(),
   {"Disconnecting {}", c.getName());
        try {
        } catch (InterruptedException | IOException e) {
            log.warn("Failed to terminate {}", c.getName(), e);

because that's where decission about node termination happens

sandrinr commented 1 year ago

I checked the logs for that code.

I find zero occurrences of Failed to terminate. I also found zero occurrences of Disconnecting hcloud-1xj5amqzz3f15wag (I can see it for successfully terminated hcloud nodes). So this code is never called or these if-statements do not evaluate to true for the nodes that stay forever.

rkosegi commented 1 year ago

Here is an idea to eliminate some of conditions. If you encounter issue again, you can open server details under Manage jenkins => Manage nodes and clouds => hcloud-something....

If it says, No details available, then it means that agent.getServerInstance() == null which would be pretty weird.

Otherwise, it could only be caused by either c.isIdle() == false (which shouldn't happen based on graph you shared) or Helper.canShutdownServer(...) is returning false which I doubt as it is unit-tested with all possible combinations of inputs.

rkosegi commented 1 year ago

One another way would be executing following groovy script in console (Manage jenkins => Script console)

import java.time.LocalDateTime
import cloud.dnation.jenkins.plugins.hetzner.Helper

Jenkins.get().getComputers().findAll{ server ->"Hetzner")}.each {computer ->

That will give us exact information needed

toabi commented 1 year ago

I have a similar feeling… that it's kept around much long than necessary. Here the output:

HetznerServerInfo(sshKeyDetail=SshKeyDetail(created=2021-10-20T14:14:10+00:00, fingerprint=85:ab:ca:0f:48:47:3a:ed:b3:79:6d:57:3b:01:7a:c3, labels={,}, name=hcloud-ssh-private-key, publicKey=ssh-rsa YYY), serverDetail=ServerDetail(name=hcloud-bb5ypntw60jwrkms, status=running, created=2022-08-08T11:03:55+00:00, publicNet=PublicNetDetail(ipv4=Ipv4Detail(blocked=false,, ip=XXX)), privateNet=[], serverType=ServerType(name=cpx41, description=CPX 41, cores=8, memory=16, disk=240), datacenter=DatacenterDetail(name=nbg1-dc3, description=Nuremberg 1 DC 3, location=LocationDetail(name=nbg1, description=Nuremberg DC Park 1, country=DE, city=Nuremberg)), image=ImageDetail(type=snapshot, status=available, name=null, description=jenkins)))
Result: [cloud.dnation.jenkins.plugins.hetzner.HetznerServerComputer@19c55552]


rkosegi commented 1 year ago

Thanks @toabi, script output which you provided confirmed bug in Helper.canShutdownServer(...)

Will fix soon.

sandrinr commented 1 year ago

Somehow this does not yet seem to work on our side.

We have some nodes up for several days now. Some of them did not get any load for a long time.

When playing around with your debug script above I am positive the the Helper.canShutdownServer() returns true between 55 and 59 minutes when diffing the creation time from the current time mod 60 minutes. computer.isIdle() also returns true.

However, can we be sure the method BeforeHourWrapsPolicy.check() is always called. In our case the nodes were inactive so long that Jenkins put them into offline state.


I used the following script:

import java.time.LocalDateTime
import cloud.dnation.jenkins.plugins.hetzner.Helper
import java.time.format.DateTimeFormatter
import java.time.ZoneOffset
import java.time.Duration;

Jenkins.get().getComputers().findAll{ server ->"Hetzner")}.each {computer ->
  final String createdStr = computer.getNode().getServerInstance().getServerDetail().getCreated()
  final LocalDateTime created = LocalDateTime.from(DateTimeFormatter.ISO_DATE_TIME.parse(createdStr))
  final LocalDateTime currentTime =
  println(Duration.between(created, currentTime.atOffset(ZoneOffset.UTC).toLocalDateTime()).toMinutes() % 60)
  println(Helper.canShutdownServer(createdStr, currentTime))

With that script I get for example

HetznerServerInfo(sshKeyDetail=SshKeyDetail(...)), serverDetail=ServerDetail(name=hcloud-kkdg8f8jk5w1xh74, status=running, created=2022-08-15T08:36:32+00:00, publicNet=PublicNetDetail(...), privateNet=[], serverType=ServerType(...), datacenter=DatacenterDetail(...), image=ImageDetail(...)))

But the server is not removed.

rkosegi commented 1 year ago

Hi @sandrinr this was my concern very early when this feature was drafted - see point 2.

I'm not aware of any way to change that behavior, it's part of Jenkins that calling our code and we make decision at that point. If we're not called, there is nothing to decide (and thus no way to terminate at right time).

Maybe widening of safety buffer (currently 5mins) would help to mitigate issue, what do you think? Otherwise, I'm pretty out of ideas.

rkosegi commented 1 year ago

However, can we be sure the method BeforeHourWrapsPolicy.check() is always called. In our case the nodes were inactive so long that Jenkins put them into offline state.

maybe this is another issue. I guess that retention strategy is not called because agent is offline.

sandrinr commented 1 year ago

It might have something to do with the agent being offline. However, the agent is going offline because it it is not used for a long time.

Currently, for example, no agents are offline, but nodes are still not destroyed. The debug script outputs:

HetznerServerInfo(sshKeyDetail=SshKeyDetail(...), serverDetail=ServerDetail(name=hcloud-pm3hkdkkjcbatbje, status=running, created=2022-08-19T08:27:40+00:00, publicNet=PublicNetDetail(...), privateNet=[], serverType=ServerType(...), datacenter=DatacenterDetail(...), image=ImageDetail(...)))

Additional info: