perfpod / memory-collector

A Kubernetes-native collector for monitoring memory subsystem interference between pods
Apache License 2.0
4 stars 3 forks source link

GitHub actions pipeline to build, test and deploy collector #4

Open amosomokpo opened 5 days ago

amosomokpo commented 5 days ago
          > > Also, the test framework might need to be executed in a CI/CD pipeline.

Yes that would be ideal! The question is can we spin up bare-metal VMs per run. I know the GCP console allows spinning up nodes with a timeout. I think it might be worth a search for a GitHub action that supports spinning up VMs for testing.

Seems like we need to get a short list of cloud providers to run the tests on, and an infra to spin those up ad-hoc (and the framework would need to support whatever provider we choose?)

Depending on the target cloud, we should look into a couple of Github actions. See action for terraform that can both provision and reclaim the test nodes or more cloud-specific actions (https://github.com/google-github-actions/setup-gcloud). We need an eventing/webhook action to listen and trigger the VM/Baremetal reclamation workflow. See - https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows

Either way, the CI/CD pipeline deserves its issue, too. I will create one.

Originally posted by @amosomokpo in https://github.com/perfpod/memory-collector/issues/2#issuecomment-2460284150

amosomokpo commented 3 days ago

Q: Is terraform our best option or is there a solution with github actions? (And of terraform is best can we use opentofu instead)

A: If we are targeting multiple clouds, yes terraform is the best option. Not all cloud providers will have a GitHub action but all of them will more likely have a terraform provider. Opentofu sounds great, that’s the best option since it’s completely open source. I’ll look into their support for embedding in GitHub actions.

amosomokpo commented 3 days ago

@yonch Is multiple clouds a requirements? Seems like only support for intel an AMD is a requirement at this point?

yonch commented 2 days ago

Yes. 👍

I think we can have a lighter-weight check to verify support in different cloud providers (as in issue #3), and creates a more extensive test suite in a single provider.

yonch commented 2 days ago

The title of this issue currently says "GitHub actions pipeline to build, test and deploy collector".

This seems like the most important issue at this point? And I think we might want to limit the scope here, otherwise this seems like a bit of a big bite to take on.

To understand what we want to build, we were thinking first trying out the Telegraf Intel RDT plugin. I think we could use whatever Docker image is publicly available, or semi-manually make one and push to e.g., Docker hub. So we can remove that from our plate here.

Issue #2 deals with benchmark workloads.

So what remains is, we need a way to trigger tests on a bare metal machine, or large VM that supports resctrl.

yonch commented 2 days ago

Here is an idea, following GitHub's "Autoscaling with self-hosted runners":

I believe we can add a nodeSelector in the AutoscalingRunnerSet from the values.yaml when deploying the controller (under template.spec). So this might require a controller deployment per node type.

yonch commented 2 days ago

This does not describe how to run the test workload on the test node, only how to provision and autoscale nodes.

Brainstorming ideas: