Automating initialisation of baremetal for self-hosted-runners

hershd23 commented 2 years ago

Current Behavior

Currently we are

Doing baremetal initialisation manually by going to the equinix UI.
Doing the setup by manually SSHing and creating users, installing dependencies etc.

The instructions are mentioned here

Desired Behavior

Ideally both these tasks should be automated, using the equinix APIs and tools like terraform

Implementation

We can use tools like terraform and use existing terraform support for equinix APIs to achieve this. https://github.com/equinix/terraform-provider-equinix https://github.com/equinix/cloud-provider-equinix-metal https://github.com/machulav/ec2-github-runner#example

Acceptance Tests

Successful action runs with complete automation would solve this issue

Additional Comments

@leecalcote @gyohuangxin would creating a runner on demand (i.e. after starting a workflow) mean that the self-hosted-runner in itself would not be needed? Given we have to register a self-hosted-runner to a repository first. https://docs.github.com/en/actions/hosting-your-own-runners/adding-self-hosted-runners

Contributor Guide

gyohuangxin commented 2 years ago

Thank you for open this issue @hershd23

@leecalcote @gyohuangxin would creating a runner on demand (i.e. after starting a workflow) mean that the self-hosted-runner in itself would not be needed? Given we have to register a self-hosted-runner to a repository first.

I think it's not exactly correct. If we follow this way: https://github.com/machulav/ec2-github-runner#example. It uses a github-runner to create a self-hosted runner on demand, but we still need to register this self-hosted runner to github repository to receive the job from our github action. It can be implemented with self-hosted REST API: https://docs.github.com/en/rest/reference/actions#self-hosted-runners

Therefore, in my opinion, our example of benchmark workflow on self-hosted runner will look like below:

name: Configurable Benchmark Test
on: 
  workflow_dispatch:
    inputs:
      profile_name:
        description: "performance profile to use"
        required: false
      profile_filename:
        description: "test configuration file"
        required: false
      service_mesh:
        type: choice
        required: false
        description: "service mesh being tested"
        options:
          - istio
          - linkerd
      load_generator:
        type: choice
        required: false
        description: "load generator to run tests with"
        options:
          - fortio
          - wrk2
          - nighthawk

jobs:
  start-runner:
    name: Start self-hosted CNCF CIL runner
    runs-on: ubuntu-latest
    outputs:
      label: ${{ steps.start-cil-runner.outputs.label }}
      cil-instance-id: ${{ steps.start-cil-runner.outputs.cil-instance-id }}
    steps:
      - name: Configure CNCF CIL credentials
        ......
      - name: Start CNCF CIL runner
        id: start-cil-runner
        uses: layer5io/meshery-smp-action
        with:
          mode: start
          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
          cil-image-id: ubuntu-20.04
          cil-instance-type: c3.small.x86
          user-data-scripts: ....
          ......
  register-runner:
    name: Register the self-hosted runner to github repo
    runs-on: ubuntu-latest
    steps: 
    ......
  run-benchmarks:
    name: Run the configurable benchmarks on the runner
    needs: 
      - start-runner # required to start the main job when the runner is ready
      - register-runner # required to register-runner
    runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
    steps: 
      - name: Setup Kubernetes
        ......
      - name: Checkout Code
        ......
      - name: Install Service Mesh and Deploy Application
        ......
      - name: Run Benchmark Tests
        ......

  stop-runner:
    name: Stop self-hosted runner
    needs:
      - start-runner # required to get output from the start-runner job
      - run-benchmarks # required to wait when the main job is done
    runs-on: ubuntu-latest
    if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
    steps:
      - name: Configure CNCF CIL credentials
        ......
      - name: Stop CNCF CIL runner
        ......

@leecalcote @navendu-pottekkat Do you have any comments on above workflow design?

pottekkat commented 2 years ago

@gyohuangxin This sounds good to me. The workflow looks straightforward.

hershd23 commented 2 years ago

Hmm right @gyohuangxin.

I wasn't so sure that it was possible to register a runner at runtime. Your flow makes a lot of sense.

gyohuangxin commented 2 years ago

I wasn't so sure that it was possible to register a runner at runtime. Your flow makes a lot of sense.

I looked at the workflow to run aws github runner, the answer is yes. The workflow for registering a runner via API will be:

Get registration token from github api: https://github.com/machulav/ec2-github-runner/blob/c34ba2df3363ebde9d19fdbc341e03d02267284d/src/index.js#L13
Start the machine with a userdata script which contains self-hosted runner registration steps. This script will be executed as soon as the machine starts, and it will use above registration token to register the self-hosted runner. https://github.com/machulav/ec2-github-runner/blob/c34ba2df3363ebde9d19fdbc341e03d02267284d/src/aws.js#L6

I'm testing it on CNCF CIL runner and will raise a PR later.

gyohuangxin commented 2 years ago

I raise a PR to automate the initialization of on-demand self-hosted CNCF CIL runner. With this workflow, the CNCF CIL machine can be created and registered as the self-hosted runner to run benchmarks on. After the benchmarks done, the machine will be stopped and removed. You can see the workflow and details from https://github.com/gyohuangxin/meshery-smp-action/actions/runs/1897832766. I still have some issues on the installation of dependences, so I made it a draft PR. But any comments are welcome.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

layer5io / meshery-smp-action