Runners are not being re-used

NiklasRosenstein commented 10 months ago

Hello! Thanks again for this nice tool.

I have pretty standard installation of the application (i.e. not many overwritten config options, besides the startup script and default image, as well as increasing the time to cleanup powered off servers). I kicked off a bunch of CI jobs on my repository earlier through commits, now it seems it has accumulated 10 powered off servers (which seems to be the default max) and it doesn't progress.

I.e. it doesn't spin up new ones (which makes sense as per the default worker limit) but also doesn't reuse the powered off runners.

These are the config values I override:

    image: ghcr.io/kraken-build/github-runners:${tag}
    command:
      - --startup-x64-script=/opt/startup-x64.sh
      - --startup-arm64-script=/opt/startup-arm64.sh
      - --max-unused-runner-time=3000  # 50 minutes
      - --max-powered-off-time=3000  # 50 minutes
      # NOTE: We can not set a default image per architecture, so this will be invalid for arm servers.
      #       See https://github.com/testflows/TestFlows-GitHub-Hetzner-Runners/issues/10
      - --default-image=x86:app:docker-ce
    environment:
      - GITHUB_TOKEN=${github_token}
      - GITHUB_REPOSITORY=${repo.name}
      - HETZNER_TOKEN=${repo.hetzner_token}
    volumes:
      - ./startup-x64.sh:/opt/startup-x64.sh:ro
      - ./startup-arm64.sh:/opt/startup-arm64.sh:ro
      - /root/.ssh/id_rsa:/root/.ssh/id_rsa:ro
      - /root/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:ro
    restart: always

Aside from this it's very vanilla, see the Dockerfile:

FROM python:3.10 as builder

RUN pip install --upgrade pip && \
    # See https://github.com/yaml/pyyaml/issues/736
    echo 'Cython < 3.0' > /tmp/constraint.txt && \
    pip install pex && \
    PIP_CONSTRAINT=/tmp/constraint.txt pex testflows.github.hetzner.runners==1.5.231020.1122452 \
        -c github-hetzner-runners -o /usr/local/bin/github-hetzner-runners

FROM python:3.10
COPY --from=builder /usr/local/bin/github-hetzner-runners /usr/local/bin/github-hetzner-runners
ENTRYPOINT [ "/usr/local/bin/github-hetzner-runners" ]

Screenshot of a currently pending job:

Maybe relevant screenshot from two of the VMs that I think should get reused:

There's been no changes to the hetzner-runners configuration in the last week.

NiklasRosenstein commented 10 months ago

I've deleted all 10 powered off servers and immediately it began spinning up new ones. :)

vzakaznikov commented 10 months ago

Hi @NiklasRosenstein,

Runners might fail to be set up for many reasons. When this happens the servers are powered off automatically. There are a few options that control what happens to the powered off servers:

The max-powered-off-time:

The powered-off servers are deleted after the **max-powered-off-time** interval, which
can be specified using the **--max-powered-off-time** option, which by default is set to *20* sec.

The other is max-unused-runner-time. Here is a snippet from the README.

--------------
Unused Runners
--------------

The scale-down service also monitors all the runners that have **unused** status and tries to delete any servers associated with such
runners if the runner is **unused** for more than the **max-unused-runner-time** period. This is needed in case a runner never gets a job
assigned to it, and the server will stay in the power-on state. This cycle relies on the fact that the runner's name
is the same as the server's name. The **max-unused-runner-time** can be specified using the **--max-unused-runner-time** option, which by default
is set to *180* sec.

There is also a case when we consider a server to be a zombie. Zombies are servers that are up but runners for them for some reason fail to register.

--------------
Zombie Servers
--------------

The scale-down service will delete any zombie servers. A zombie server is defined as any server that fails to register its runner within
the **max-runner-registration-time**. The **max-runner-registration-time** can be specified using the **--max-runner-registration-time** option
which by default is set to *180* sec.

In your case, you have the max-powered-off-time set to 50 min. So the server will only be deleted or recycled after 50 min. https://github.com/testflows/TestFlows-GitHub-Hetzner-Runners/blob/main/testflows/github/hetzner/runners/scale_down.py#L430 shows the logic. The server will not be recycled until that timeout is hit.

Why do you have the max-powered-off-time set to 50 min? Have you waited 50 min to see these server to be recycled?

vzakaznikov commented 9 months ago

Hi @NiklasRosenstein, has my comment above addressed your issue?

vzakaznikov commented 9 months ago

Closing the issue for now. @NiklasRosenstein, please feel to add more details and we can re-open it.

testflows / TestFlows-GitHub-Hetzner-Runners

Runners are not being re-used #14