philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.58k stars 619 forks source link

Additional runners launched #3290

Closed sdarwin closed 1 year ago

sdarwin commented 1 year ago

Description of problem:

Create a github actions workflow that has a mix of self-hosted and standard jobs. Some on github's infrastructure, and others on self-hosted runners. In the example below, there are 3 self-hosted jobs and 7 which are not self-hosted.

How many terraform runners ought to be created in this situation?

You would think the answer is 3. And having exactMatch: true should help, right? What is the effect of exactMatch in this situation? Since regular jobs do not have a "self-hosted" label they are not an exact match.

The result can easily be replicated. Use the multi-runner example https://github.com/philips-labs/terraform-aws-github-runner/tree/main/examples/multi-runner with max count of 10, and ephemeral = true, and the workflow file below.

(only) these modifications to examples/multi-runner: runners_maximum_count: 10 enable_ephemeral_runners: true

.github/workflows/ci.yml

name: CI
on:
  pull_request:
    branches:
      - master
      - develop
      - feature/**
  push:
    branches:
      - '*'
      - '**'
jobs:
  build:
    defaults:
      run:
        shell: bash
    strategy:
      fail-fast: false
      matrix:
        include:
          - name: test1
            os: [ self-hosted, linux, x64, ubuntu-latest ]
          - name: test2
            os: [ self-hosted, linux, x64, ubuntu-latest ]
          - name: test3
            os: [ self-hosted, linux, x64, ubuntu-latest ]
          - name: test4
            os: ubuntu-latest
          - name: test5
            os: ubuntu-latest
          - name: test6
            os: ubuntu-latest
          - name: test7
            os: ubuntu-latest
          - name: test8
            os: ubuntu-latest
          - name: test9
            os: ubuntu-latest
          - name: test10
            os: ubuntu-latest
    name: ${{ matrix.name }}
    timeout-minutes: 120
    runs-on: ${{ matrix.os }}
    steps:
      - name: Test
        shell: bash
        run: |
          echo "Running the test on ${{ matrix.os }}"
          sleep 600
          true

It's not very serious if these extra 7 runners start, because they ultimately don't get assigned any jobs, and then the brute force scale-down terminates the instances fairly quickly. But still, it would be preferable if they were never launched at all, right?

nocturne1 commented 1 year ago

Have you tried changing the runner_extra_labels and the labelMatchers on your self-hosted runner deployment to something different than 'ubuntu-latest'?

sdarwin commented 1 year ago

Hi @nocturne1 ,

Have you tried changing the label on your self-hosted runners to something different than 'ubuntu-latest' ?

I believe you are right, if there's no overlap with labels whatsoever, the extra runners won't be launched. I tested that with macOS.

So if none of the self-hosted runners were configured with the label 'ubuntu-latest', the problem wouldn't happen.

But the idea is to make it easy for our users to add a set of "self-hosted" labels to their existing jobs, and switch to "self-hosted". "ubuntu-latest" or "ubuntu-2004" would still be part of both environments.

What is the purpose and meaning of exactMatch if not to fix this issue?

nocturne1 commented 1 year ago

I think in some cases, such as when pools are used, this type of thing can happen because the label matching shifts over to GitHub, rather than the lambdas. But clearly this isn't the case here.

That being said, I see that there was some work in yesterday's release 3.4.1 that specifically has some changes in the label matching for pools. Might as well try that (I haven't had a chance to upgrade).

Also not a bad idea to take a look at the scale-up lambda logs in cloudwatch, as I think that'll show what it's matching on. Or try 'enable_job_queued_check=true', which seems to try to more closely correlate the requested jobs with the runners.

sdarwin commented 1 year ago

try 'enable_job_queued_check=true'

Based on the documentation, that isn't the desired functionality. "enable_job_queued_check to true you can enforce a rule of only creating a runner if the event has a correlated queued job." One can imagine a situation when GitHub is very busy. That's a problem we are hoping to address. During those times, jobs will be sitting and waiting. They will be "queued". It is likely there will be a "correlated queued job" if many jobs are queued.

What seems to be the right description is:

""" enable_runner_workflow_job_labels_check_all If set to true all labels in the workflow job must match the GitHub labels (os, architecture and self-hosted). When false if any label matches it will trigger the webhook. """

Examining the codebase, what enable_runner_workflow_job_labels_check_all does is configure exactMatch ? And that is enabled in the multi-runner example.

GuptaNavdeep1983 commented 1 year ago

https://github.com/philips-labs/terraform-aws-github-runner/blob/main/modules/multi-runner/variables.tf#L177 As pointed out here, the purpose of exactMatch is to match all labels in workflow with the labels supported by the runners but not vice versa. All labels supported by runner configuration need not be present in workflow job labels.

Given that in your case ubuntu-latest workflow job label is present in runner configuration, runners will be scaled. Even if runners are not scaled, ubuntu-latest job is going to be picked by the already present runners (Scaled by other job labels) if the runner supports ubuntu-latest label which is GitHub functionality.

https://github.com/philips-labs/terraform-aws-github-runner/blob/main/main.tf#L139 yes, enable_runner_workflow_job_labels_check_all is being used in top level runner as can be seen above.

sdarwin commented 1 year ago

but not vice versa

Thanks for the answer. Very interesting! It's easy to get those backwards. It might be helpful to have a paragraph in the documentation that explains the topic of matching by example, both when the workflow has more labels or fewer labels than the runner has.

The functionality we are planning to implement is to dynamically modify the workflow file to have either

runs-on: [ self-hosted, linux, x64, ubuntu-latest ]

or

runs-on: ubuntu-latest

So, in order for that to work, I guess exactMatch isn't enough.

How about another variable called fullMatch?

is going to be picked by the already present runners

If we implement the workflows as planned, it will be all-or-nothing. The workflow will be all self-hosted, triggering terraform runners to launch and process all jobs, or it will be all ubuntu-latest, meaning no terraform runners should launch. But a mix of those two could be problematic. If there are 10 jobs, and only 3 self-hosted ephemeral runners, it would be possible the self-hosted runners choose to process the plain ubuntu-latest jobs by accident, and then exit, causing the rest of the self-hosted jobs to remain in limbo.

In that case, there is an easier way. But not as elegant. The idea nocturne1 mentioned to use different labels entirely. Toggle between these options:

runs-on: [ self-hosted, linux, x64, ubuntu-latest-self-hosted ]

or

runs-on: ubuntu-latest

sdarwin commented 1 year ago

I guess that's the answer - ensure there is no overlap in the labels whatsoever. I will close the issue, but if you have any ideas about how a fullMatch strategy might work, let me know. Thanks.