GitHub actions pipeline to build, test and deploy collector

amosomokpo commented 5 days ago

          > > Also, the test framework might need to be executed in a CI/CD pipeline.

Yes that would be ideal! The question is can we spin up bare-metal VMs per run. I know the GCP console allows spinning up nodes with a timeout. I think it might be worth a search for a GitHub action that supports spinning up VMs for testing.

Seems like we need to get a short list of cloud providers to run the tests on, and an infra to spin those up ad-hoc (and the framework would need to support whatever provider we choose?)

Depending on the target cloud, we should look into a couple of Github actions. See action for terraform that can both provision and reclaim the test nodes or more cloud-specific actions (https://github.com/google-github-actions/setup-gcloud). We need an eventing/webhook action to listen and trigger the VM/Baremetal reclamation workflow. See - https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows

Either way, the CI/CD pipeline deserves its issue, too. I will create one.

Originally posted by @amosomokpo in https://github.com/perfpod/memory-collector/issues/2#issuecomment-2460284150

amosomokpo commented 3 days ago

Q: Is terraform our best option or is there a solution with github actions? (And of terraform is best can we use opentofu instead)

A: If we are targeting multiple clouds, yes terraform is the best option. Not all cloud providers will have a GitHub action but all of them will more likely have a terraform provider. Opentofu sounds great, that’s the best option since it’s completely open source. I’ll look into their support for embedding in GitHub actions.

amosomokpo commented 3 days ago

@yonch Is multiple clouds a requirements? Seems like only support for intel an AMD is a requirement at this point?

yonch commented 2 days ago

Yes. 👍

I think we can have a lighter-weight check to verify support in different cloud providers (as in issue #3), and creates a more extensive test suite in a single provider.

yonch commented 2 days ago

The title of this issue currently says "GitHub actions pipeline to build, test and deploy collector".

This seems like the most important issue at this point? And I think we might want to limit the scope here, otherwise this seems like a bit of a big bite to take on.

To understand what we want to build, we were thinking first trying out the Telegraf Intel RDT plugin. I think we could use whatever Docker image is publicly available, or semi-manually make one and push to e.g., Docker hub. So we can remove that from our plate here.

Issue #2 deals with benchmark workloads.

So what remains is, we need a way to trigger tests on a bare metal machine, or large VM that supports resctrl.

yonch commented 2 days ago

Here is an idea, following GitHub's "Autoscaling with self-hosted runners":

Run a Kubernetes cluster on one of the cloud providers
Use GitHub Actions to trigger tests
Tests run self hosted on the Kubernetes cluster. The actions-runner-controller seems to be the official controller for this.
Each test that requires a specific node type will trigger a runner that only runs on that node type

I believe we can add a nodeSelector in the AutoscalingRunnerSet from the values.yaml when deploying the controller (under template.spec). So this might require a controller deployment per node type.

yonch commented 2 days ago

This does not describe how to run the test workload on the test node, only how to provision and autoscale nodes.

Brainstorming ideas:

use dind to run a Kubernetes control plane on the node, and run workload containers through that control plane
add more containers to the runner template (but then we cannot use the helm charts for workloads)
modify workload helm charts to always run automatically on nodes of a certain type with DaemonSets (but that reduces the flexibility to modify workloads)

perfpod / memory-collector

GitHub actions pipeline to build, test and deploy collector #4