Closed NiklasRosenstein closed 8 months ago
I've deleted all 10 powered off servers and immediately it began spinning up new ones. :)
Hi @NiklasRosenstein,
Runners might fail to be set up for many reasons. When this happens the servers are powered off automatically. There are a few options that control what happens to the powered off servers:
The max-powered-off-time
:
The powered-off servers are deleted after the **max-powered-off-time** interval, which
can be specified using the **--max-powered-off-time** option, which by default is set to *20* sec.
The other is max-unused-runner-time
. Here is a snippet from the README.
--------------
Unused Runners
--------------
The scale-down service also monitors all the runners that have **unused** status and tries to delete any servers associated with such
runners if the runner is **unused** for more than the **max-unused-runner-time** period. This is needed in case a runner never gets a job
assigned to it, and the server will stay in the power-on state. This cycle relies on the fact that the runner's name
is the same as the server's name. The **max-unused-runner-time** can be specified using the **--max-unused-runner-time** option, which by default
is set to *180* sec.
There is also a case when we consider a server to be a zombie. Zombies are servers that are up but runners for them for some reason fail to register.
--------------
Zombie Servers
--------------
The scale-down service will delete any zombie servers. A zombie server is defined as any server that fails to register its runner within
the **max-runner-registration-time**. The **max-runner-registration-time** can be specified using the **--max-runner-registration-time** option
which by default is set to *180* sec.
In your case, you have the max-powered-off-time
set to 50 min. So the server will only be deleted or recycled after 50 min.
https://github.com/testflows/TestFlows-GitHub-Hetzner-Runners/blob/main/testflows/github/hetzner/runners/scale_down.py#L430 shows the logic. The server will not be recycled
until that timeout is hit.
Why do you have the max-powered-off-time
set to 50 min? Have you waited 50 min to see these server to be recycled?
Hi @NiklasRosenstein, has my comment above addressed your issue?
Closing the issue for now. @NiklasRosenstein, please feel to add more details and we can re-open it.
Hello! Thanks again for this nice tool.
I have pretty standard installation of the application (i.e. not many overwritten config options, besides the startup script and default image, as well as increasing the time to cleanup powered off servers). I kicked off a bunch of CI jobs on my repository earlier through commits, now it seems it has accumulated 10 powered off servers (which seems to be the default max) and it doesn't progress.
I.e. it doesn't spin up new ones (which makes sense as per the default worker limit) but also doesn't reuse the powered off runners.
These are the config values I override:
Aside from this it's very vanilla, see the Dockerfile:
Screenshot of a currently pending job:
Maybe relevant screenshot from two of the VMs that I think should get reused:
There's been no changes to the hetzner-runners configuration in the last week.