kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.81k stars 2.61k forks source link

Questions on migrating kubespray CI to test-infra #31351

Open VannTen opened 7 months ago

VannTen commented 7 months ago

Hi,

We're currently evaluating migrating the CI of the kubespray project to test-infra from gitlab-ci, and I have some questions and what we can and cannot do with prow and test-infra, so we can decide whether it can work:

Currently, we're handling jobs in gitlab ci stages, because some takes a lot of time and we try to fail early.

I understand prow does not have a job dependency concepts, so I have two possible strategies in mind:

Regarding tekton pipelines:

Some of our jobs currently provision kubevirt VM (https://github.com/kubernetes-sigs/kubespray/blob/master/tests/cloud_playbooks/roles/packet-ci/templates/vm.yml.j2) to test kubespray runs on them. Is there something in prow/test-infra which can do that for us ? (Didn't find anything but well it does not hurt to ask).

Regarding compute resources:

That's a lot of different questions in different directions, but I'm trying to figure things out, so sorry if this is a bit unclear.

Related issue on kubespray : kubernetes-sigs/kubespray#10682

Cc @floryut @ant31 from kubespray

Thanks

VannTen commented 7 months ago

/sig testing /sig cluster-lifecycle

aojea commented 7 months ago
  • What's the policy on amount of resources which a project can use ? We have rather big CI runs so this might be a concern. I understand that we tell Prow to execute jobs in a specific cluster, can we bring our own ? Should we ?

@BenTheElder @ameukam @upodroid for resource usage

aojea commented 7 months ago

/sig k8s-infra

BenTheElder commented 7 months ago

So far really large testing is basically only done for scale testing the core Kubernetes project.

These are all relative terms though.

SIG K8s Infra owns the actual resource policy, which is not well defined yet, but I can speak to it a little as a lead in both SIGs, can you be more specific about what you're intending to run?

We just went through measures this year to reduce spend, and we're resuming the process of finishing moving lingering CI/resources out of google.com projects into kubernetes.io on GCP in particular.

Cancel presubmits on any failure of one of the presubmits. Is that possible / does it work well with things like /retest ?

No, this is not supported, please don't put lots of expensive testing in presubmit. You should only test commonly broken workflows in presubmit and the rest in postsubmit / periodic.

Regarding tekton pipelines:

Not supported on prow.k8s.io, sorry.

Some of our jobs currently provision kubevirt VM (https://github.com/kubernetes-sigs/kubespray/blob/master/tests/cloud_playbooks/roles/packet-ci/templates/vm.yml.j2) to test kubespray runs on them. Is there something in prow/test-infra which can do that for us ? (Didn't find anything but well it does not hurt to ask).

No, please do not try to use kubevirt on our clusters (this is why we created KIND), you'll need to spin up remote machines.

neolit123 commented 7 months ago

kubespray has been a bit of a black sheep, where the project over the years have drifted away from the commons - it has its own CI and Zoom account, not using the community ways...that's not necessarily bad, but rather peculiar.

We're currently evaluating migrating the CI of the kubespray project to test-infra from gitlab-ci,

could you explain what might be the reasons for such a migration? (i don't think i saw them listed in the OP)

VannTen commented 7 months ago

can you be more specific about what you're intending to run?

Here is a typical PR runs: https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/pipelines/1091836466 What takes the most time is the job deploy-part2, which basically spins up VMs and plays kubespray on it with various configuration (network_plugin, base OS, container runtime).

No, this is not supported, please don't put lots of expensive testing in presubmit. You should only test commonly broken workflows in presubmit and the rest in postsubmit / periodic.

Postsubmits and periodics do not guard merging in the main branch, is that correct ? Understood though.

Regarding tekton pipelines:

Not supported on prow.k8s.io, sorry.

ACK. Just one question, does that depend on the service cluster (where prow itself runs) or the build cluster ?

No, please do not try to use kubevirt on our clusters (this is why we created KIND), you'll need to spin up remote machines.

Ok. But is bringing our own cluster as a Prow "build cluster" a possibility, or not at all ?


We're currently evaluating migrating the CI of the kubespray project to test-infra from gitlab-ci,

could you explain what might be the reasons for such a migration? (i don't think i saw them listed in the OP)

Sorry for that, I listed them in the issue on kubespray side. Basically:

BenTheElder commented 7 months ago

Postsubmits and periodics do not guard merging in the main branch, is that correct ? Understood though.

Right, if you find an issue you can revert. We have testgrid.k8s.io to aid in that. If you're frequently reverting because of changes not caused in presubmit, consider presubmit.

But we consider for example 5,000 node scale tests as an extreme example. We find bugs surfaced in those tests and yet they do not gate all PR merges because this is unreasonably expensive.

ACK. Just one question, does that depend on the service cluster (where prow itself runs) or the build cluster ?

The project doesn't have anyone maintaining support for this. We support prow decorated jobs.

Ok. But is bringing our own cluster as a Prow "build cluster" a possibility, or not at all ?

K8S Infra is only using community managed resources going forward because we've been bitten repeatedly with issues depending on third party controlled accounts etc.

We do not support "bring your own", if anyone wants to help fund the project with assets they can talk to the CNCF about setting up something like https://www.cncf.io/google-cloud-recommits-3m-to-kubernetes/ which SIG K8s Infra administers and SIG Testing uses to run CI.

What takes the most time is the job deploy-part2, which basically spins up VMs and plays kubespray on it with various configuration (network_plugin, base OS, container runtime).

This looks like an expansive test matrix as large as we'd typically do in periodic testing only, not on every PR.

It's difficult to understand what sort of expense we're talking about here though, just seeing the gitlab pipeline names. Admittedly I haven't really had time yet to try to uncover what exactly they all run and how resource heavy they are.

Generally when sig subprojects have started using our CI in the past they've had relatively minimal needs, some cheap unit tests and so on. We have not had a new distro / deployment tool onboard in a long time since maybe cluster API so there's not a lot of precedent here.

dims commented 7 months ago

@VannTen Can you please share a bit of the background on the gitlab infra itself? Who paid for it? who set it up? For some of us this is fresh unforeseen news sadly!

VannTen commented 7 months ago

Postsubmits and periodics do not guard merging in the main branch, is that correct ? Understood though.

Right, if you find an issue you can revert. We have testgrid.k8s.io to aid in that. If you're frequently reverting because of changes not caused in presubmit, consider presubmit.

But we consider for example 5,000 node scale tests as an extreme example. We find bugs surfaced in those tests and yet they do not gate all PR merges because this is unreasonably expensive. ... This looks like an expansive test matrix as large as we'd typically do in periodic testing only, not on every PR.

Yeah, I think it is.

Ok. So in our cases, for example, that would translate to the test matrix moving to periodics, and keeping one/some default configuration tests in presubmits ? Maybe we could also use run_if_changed to target some configuration ? (network plugin when corresponding role was touched, etc)

It's difficult to understand what sort of expense we're talking about here though, just seeing the gitlab pipeline names. Admittedly I haven't really had time yet to try to uncover what exactly they all run and how resource heavy they are.

The jobs in deploy-part2 typically runs for around 40 minutes, and use 1 to 3 VM (using kubevirt) (see here + the job itself. I can't find the VMs size, it's defined as small for kubevirt.

Given there is around 20-25 fives config configuration, that adds up.

Generally when sig subprojects have started using our CI in the past they've had relatively minimal needs, some cheap unit tests and so on. We have not had a new distro / deployment tool onboard in a long time since maybe cluster API so there's not a lot of precedent here.


Can you please share a bit of the background on the gitlab infra itself? Who paid for it? who set it up? For some of us this is fresh unforeseen news sadly!

I'll share what I can, I don't have all the information or history.

The integration github <-> gitlab-ci was done by @ant31 if I'm correct, and use https://github.com/failfast-ci/failfast-api

The infrastucture was provided by Packet, (which is now Equinix Metal) and I think it's on CNCF cloud credits The PRs kubernetes-sigs/kubespray#4538 kubernetes-sigs/kubespray#4537 were made by @woopstar but I'm not exactly sure who handled the cluster setup and the interaction with Packet/Equinix. Currently, at least @yankay and @floryut have access to it and fix the occasional breakage.

(If some of the people mentioned have more info, feel free to correct or precise :+1: )

ant31 commented 7 months ago

Hi all,

The background is that kubespray started with Kubernetes 1.0, so there was little around to help the community. The community CI used to be Travis-CI. Unfortunately, kubespray was using too many resources, and we had to move out.

CNCF allocated us a few bare-metal nodes (and still does) to run our pipelines.

We are deploying and maintaining those nodes ourselves. They are running the Gitlab-runners and we're deploying most of the VM (via kubevirt) on it, too. We also have, or used to have, a few jobs deploying k8s on GCE VM (to test the cloud settings).

Why Gitlab-ci?

2016-2017 The gitlab-ci was a good alternative to combine low maintenance (only need to deploy the runner), and it checkboxes most of the requirements (complex pipeline, with manual jobs and stages), we filled in the missing github integration and features with https://github.com/failfast-ci/failfast-api

We create empty VMs to mimic end-user environments:

Moving to prow would remove the need to maintain bare-metal nodes, failfast-ci project, and a few other benefits, but we must be able to configure an equivalent pipeline.

neolit123 commented 7 months ago

This looks like an expansive test matrix as large as we'd typically do in periodic testing only, not on every PR.

same for kOps? https://testgrid.k8s.io/sig-cluster-lifecycle-kops

We create empty VMs to mimic end-user environments:

the test matrix with CNI, distro is redundantly complex (see the kOps case above). kubeadm and CAPI don't do that, it's too much...and a bit crazy. if CNI foo decided to regress the k8s infra should not pay $$$ for it. for distros, there are a 100 Linux flavors.

i wouldn't want us to say -1 on kubespray if they want to move to prow, but if i could, i'd happily take 50% of the test bandwidth of kOps and give it to kubespray.

upodroid commented 7 months ago

The bulk of the CI that we run that requires testing on a real virtual machine involves creating VMs on AWS/GCP. We have tooling that handles that for us and you would need to adopt it.

kops is a good example of what you'll need to do to adopt the Kubernetes CI.

Here are a couple of examples:

VannTen commented 7 months ago

On Mon, Dec 04, 2023 at 02:13:02PM -0800, Benjamin Elder wrote:

Postsubmits and periodics do not guard merging in the main branch, is that correct ? Understood though.

Right, if you find an issue you can revert. We have testgrid.k8s.io to aid in that. If you're frequently reverting because of changes not caused in presubmit, consider presubmit.

Another question about that: what's the typical frequency for periodics ? Daily, weekly ?

Do other projects have some strategy in place to avoid breakage in their main branch ? Having a separate "dev" branch for instance, only merged in the main branch at the same frequency than periodics run ?

upodroid commented 7 months ago

Another question about that: what's the typical frequency for periodics ? Daily, weekly ?

daily for the latest release, weekly for older supported releases or rare scenarios.

For kubespray in particular, I would test 2 or 3 scenarios for a proper e2e test in presubmits(runs on every push to a PR) and then run the e2e test matrix once a day or twice at most.

BenTheElder commented 7 months ago

same for kOps? https://testgrid.k8s.io/sig-cluster-lifecycle-kops

This is overlooking the "not on presubmit" aspect of my comment. I'm well aware of the kops test matrix, that's exactly what I was thinking of.

That matrix is actually designed to minimally identify which aspect is broken and the tooling for it is in this repo.

i wouldn't want us to say -1 on kubespray if they want to move to prow, but if i could, i'd happily take 50% of the test bandwidth of kOps and give it to kubespray.

I don't think that's a reasonable dichotomy. kops has been using these resources in good faith as a long time participant in upstream test tooling, infra, etc.

Also, we're (SIG Testing + SIG K8s Infra) planning to use kops to replace kube-up because we desperately need to eliminate kube-up.sh and we need to be flexible in AWS+GCP spend, so we certainly don't want to reduce test coverage. (There is a KEP in flight)

BenTheElder commented 7 months ago

Another question about that: what's the typical frequency for periodics ? Daily, weekly ?

Do other projects have some strategy in place to avoid breakage in their main branch ? Having a separate "dev" branch for instance, only merged in the main branch at the same frequency than periodics run ?

Reasonably frequent on the main branch (multiple times per day), much less frequent on stable release branches with frequency decreasing for older releases (and none for out of support releases).

neolit123 commented 7 months ago

same for kOps? https://testgrid.k8s.io/sig-cluster-lifecycle-kops

This is overlooking the "not on presubmit" aspect of my comment. I'm well aware of the kops test matrix, that's exactly what I was thinking of.

i agree with the comments from earlier that presubmit should be minimal and fail-fast. no matrix testing.

That matrix is actually designed to minimally identify which aspect is broken and the tooling for it is in this repo.

i wouldn't want us to say -1 on kubespray if they want to move to prow, but if i could, i'd happily take 50% of the test bandwidth of kOps and give it to kubespray.

I don't think that's a reasonable dichotomy. kops has been using these resources in good faith as a long time participant in upstream test tooling, infra, etc.

it's not, but it's anecdotally hinting of fairness and non-bias. kubespray should not be denied bandwidth just because they are late for the party.

Also, we're (SIG Testing + SIG K8s Infra) planning to use kops to replace kube-up because we desperately need to eliminate kube-up.sh and we need to be flexible in AWS+GCP spend, so we certainly don't want to reduce test coverage. (There is a KEP in flight)

jobs such as https://testgrid.k8s.io/sig-cluster-lifecycle-kops#kops-grid-cilium-deb10-k27 would not be contributing much to the kube-up replacement picture. such jobs are effectively testing a user deployment scenario. they just guarantees to maintainers and users that a certain deployment scenario works, not that kOps itself works. i don't want to speak behind the intent of these jobs, though.

neolit123 commented 7 months ago

i don't think we have a way to measure how much $$ is generated per SIG, but i wild guess that SIG CL is a major contributor to our budget reduction due to how much subprojects and e2e test jobs we have... i would not be surprised if at some point we have to do some sort of evaluation and ask maintainers to limit how much they test.

BenTheElder commented 7 months ago

it's not, but it's anecdotally hinting of fairness and non-bias. kubespray should not be denied bandwidth just because they are late for the party.

The problem is moreso that we need to determine if we have bandwidth to spin things up (we probably don't at the moment -- AWS Spend is hitting the budget cap, but we're going to optimize costs) and we've already had to cut down on spend like scale testing this year unfortunately due to lack of options.

We shouldn't do more cutting of existing usage until we have a policy in place. (Though we can run equivilantly with less cost e.g. committed use discounts). We need to have a framework in place before we start kicking things off, we haven't done that yet (because we've been too busy reacting to the ongoing issues).

As-is kubespray has running CI without us cutting any other CI off, so we don't have to choose between projects yet.

i don't think we have a way to measure how much $$ is generated per SIG, but i wild guess that SIG CL is a major contributor to our budget reduction due to how much subprojects and e2e test jobs we have... i would not be surprised if at some point we have to do some sort of evaluation and ask maintainers to limit how much they test.

This is a tricky topic, we have a lot of jobs that aren't really "benefitting" a single SIG.

would not be contributing much to the kube-up replacement picture. such jobs are effectively testing a user deployment scenario.

To that point, the cloud provider testing is specific to a particular vendor ... it's not going to be that simple to dismiss categories of testing. We have similar compat testing with cri-o and containerd. Ideally the project should select testing that benefits broadly but we do have to run with actual implementations evetually.

BenTheElder commented 7 months ago

So, I think we can run kubespray CI on prow, but it remains an open question how best to enable the test environments you need and how much we can afford.

I don't think that's kubevirt, we use managed k8s clusters because we have limited bandwidth to maintain these things and nested virt isn't enabled.

We can start with something small like unit tests so they can get familiar with prow and we don't need to worry too much about the resources needed for that.

For e2e testing:

When other projects spin up external assets they do so by renting resources through https://github.com/kubernetes-sigs/boskos typically through integration in https://github.com/kubernetes-sigs/kubetest2 to ensure that they will be automatically cleaned up if the CI job is abruptly terminated or otherwise fails to clean-up after itself.

This aspect is pretty important, I'd ask that we make sure boskos is used if / when e2e tests are setup. CAPI, kops, Kubernetes etc use this.

BenTheElder commented 7 months ago

aside re: freeing up resources for CI etc ... we dug into our expenditures in the bi-weekly k8s infra call yesterday and the main outcome is going on here https://github.com/kubernetes/k8s.io/issues/6165

I think we can easily run things like build/unit test/lint on prow already but it will take more work to setup a suitable envionment for the e2e tests. We haven't used packet/equinix from prow before but that might be an option for running essentially the same e2e environment.

What if we ran a build cluster on equinix w/ kubevirt? Would the kubespray team be up for maintaining this? I think prow as-is can handle scheduling to a cluster like this fine.

We probably need to discuss options more between k8s infra and sig testing calls.

ant31 commented 7 months ago

What if we ran a build cluster on equinix w/ kubevirt? Would the kubespray team be up for maintaining this? I think prow as-is can handle scheduling to a cluster like this fine.

Yes, it would work I think. I don't know prow enough know what would need to change if any. So I'll describe current CI: new-pr --> Gitlab CI triggered-> Gitlab schedule Jobs on gitlab-runners deployed on equinix
A job is started and the runner execute the following:

  1. Create kubevirt VM via kubectl apply
  2. Wait for VM to be up
  3. Deploy kubernetes on the VM
  4. Test the cluster
  5. Destroy VM

if there's an equivalent of gitlab-runner for prow(prow-runner?) deployed on that cluster, then it would use the resources that kubespray has already without adding loads/expenses on k8s-infra

As nice to have, maybe step 1,2 and 5 could be handled by prow so it's easily reproduced by all projects (to create kubevirt VM). In all cases it's not a blocker.

VannTen commented 7 months ago

What if we ran a build cluster on equinix w/ kubevirt? Would the kubespray team be up for maintaining this?

I don't know about the others kubespray contributors, but I could participate. I have some dedicated time for upstream work + the down-time, and my main occupation is maintaining clusters anyway.

VannTen commented 6 months ago

We haven't used packet/equinix from prow before but that might be an option for running essentially the same e2e environment.

What if we ran a build cluster on equinix w/ kubevirt? Would the kubespray team be up for maintaining this? I think prow as-is can handle scheduling to a cluster like this fine.

In that case, would the same constraints (mainly, moving stuff to periodics) apply ?

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

VannTen commented 3 months ago

/remove-lifecycle stale

BenTheElder commented 1 month ago

In that case, would the same constraints (mainly, moving stuff to periodics) apply ?

I don't think we have transparent budget info or credentials for equinix in SIG K8s infra currently so it's hard to say, AFAIK that's similarly ~CNCF, like your current gitlab instance, rather than Kubernetes owned/managed.

cc @dims who has the only Kubernetes related equinix infra I've previously seen (cs.k8s.io, a single machine AFAIK).

BenTheElder commented 1 month ago

(We are still planning the migration of prow control plane to k8s infra this year, amongst other things, I'm personally a bit over-extended WRT k8s infra but I'm not the only lead, I know Arnaud is out for a while currently)