Consistent CI environment

kkaempf commented 2 years ago

Our current CI / workers / runners setup is somewhat 'spread' across internal and AWS machines. We should try to have it all in one place and properly documented.

Paul, Itxaka, Julien, and Loic - phrase this issue correctly and add acceptance criterias

Itxaka commented 2 years ago

elemental-toolkit is using arm64 workers (1 in aws, 2 internal suse) for building arm64 packages, rest of the jobs use github workers.
elemental-cli is using ONLY github workers
elemental-operator is using ONLY github workers
elemental is using github workers and internal workers for the end2end test (https://github.com/rancher/elemental/blob/main/.github/workflows/e2e.yaml) and the release job (https://github.com/rancher/elemental/blob/main/.github/workflows/release.yaml) but I think the release job should NOT use the build-host. Its just that the release job is not doing anything currently and it may need to either be dropped or reworked to release something...not sure what.

So Im guessing this is about consolidating everything to use either github workers or cloud workers when needed. This can be done easily for toolkit, not sure about elemental end2end tests as those use VMs to test everything...

ldevulder commented 2 years ago

but I think the release job should NOT use the build-host.

It was to speedup the build process, but can use a GH runner instead of s self-hosted one yes.

Itxaka commented 2 years ago

It was to speedup the build process, but can use a GH runner instead of s self-hosted one yes.

I think we need first to check what are we gonna release as part of the elemental releases. If its just the OCI artifacts then we can just use a github workers as that would take about 5 minutes.

Itxaka commented 2 years ago

arm64 workers are available in GCE. I created an instance template called elemental-ci-runner-arm64-v2 which contains the bare minimum to support creating VMs that can run the runner. The template has an script attached to install dependencies and has me @davidcassany and @fgiudici keys also injected on the machine created via the template.

The only thing needed after create an instance from that template is to ssh in and download+run the worker service. Tested 1 instance with those steps (available on github -> runners -> add runner) and it results into a worker that runs the build jobs properly.

From my point of view GCE supports our use case for the arm64 workers should we decide to move in there, which I know @ldevulder was interested in.

Price of the machine would be 108$ per month.

Itxaka commented 2 years ago

Looks like GKE clusters are also available which could be a good way of deploying workers and save money, as the priceis per pod per hour, which seems to be much cheaper than a full vm.

The problem is as usual, we need to set a TOKEN_ID for the github runner and we either add it manually or create automation in a custom image to auto-get the TOKEN_ID. That requires a github PAT on the cluster config but has the potential to allow us to autoscale on times of a lot of traffic to the workers and scale down when there is none...

Azure containers seems to be work the same with AKE.

This options seems to be more expensive (seems like they are more suited to bringing them up and down on demand, i.e not sustained used) And requires development on our side to set it right for bringing those pods on demand.

Itxaka commented 2 years ago

Azure Arm instances are also available, so its mostly up to us to decide where to move everything. I have no preference one way or another.

Itxaka commented 2 years ago

@ldevulder could you comment on your preferred cloud operator in case the end2end tests should need to move on down the line? Same with @juadk for the UI tests.

I need to create a new arm64 runner and would like to know in which operator it needs to go :)

ldevulder commented 2 years ago

@ldevulder could you comment on your preferred cloud operator in case the end2end tests should need to move on down the line? Same with @juadk for the UI tests.

I prefer GCP over to Azure personally. I saw lot of sporadic issues on Azure compared to GCP.

juadk commented 2 years ago

Same to me, I'm not in love with Azure... I would go with GCP as well.

Itxaka commented 2 years ago

nice, that settles it GCE it is. Thanks folks!

Itxaka commented 2 years ago

Aws runner has been tear down and GCE runner has been setup. Several jobs have been triggered and all of them passed correctly.

Itxaka commented 2 years ago

@juadk @ldevulder Im wondering if you folks are gonna deploy the needed VMs for the e2e/UI jobs or am I supposed to do so?

In case you want me to do it, I would need some specs here like OS, vCPU, MEM, Disk space and speed. Cheers!

ldevulder commented 2 years ago

@juadk @ldevulder Im wondering if you folks are gonna deploy the needed VMs for the e2e/UI jobs or am I supposed to do so?

No, we will take care of this. But as I said to @davidcassany yesterday it's not high priority for me, we still have some E2E tests to (re)add and we have a deadline ;-). I will try to do this maybe in 2 weeks.

Itxaka commented 2 years ago

ok cool!

ldevulder commented 2 years ago

FYI I will work on this for E2E tests week 38.

ldevulder commented 2 years ago

Will be follow in issue https://github.com/rancher/elemental/issues/336.

rancher / elemental

Consistent CI environment #301