kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.58k stars 1.31k forks source link

Expand test framework to include upstream k8s testing #2826

Closed CecileRobertMichon closed 2 years ago

CecileRobertMichon commented 4 years ago

⚠️ Cluster API maintainers can ask to turn an issue-proposal into a CAEP when necessary, this is to be expected for large changes that impact multiple components, breaking changes, or new large features.

ie. Use CAPI to test Kubernetes

Goals

  1. Use CAPX as deployer to test upstream k/k changes
  2. Use CAPX as deployer to test k8s-sigs projects such as out-of-tree cloud providers
  3. Run upstream k8s conformance tests against CAPI
  4. Encourage reuse across different infra providers instead of maintaining bash scripts in each in provider repo (right now there are scripts in CAPG, CAPA, and CAPZ, with significant overlap). We intend to extend the current test/framework to allow this proposal to be implemented there.

Non-Goals/Future Work

  1. Add a kubetest deployer for CAPI.
  2. Run the tests as PR gates on k/k.

User Story

As a developer I would like to run k/k E2E tests on a CAPI cluster to test changes made to a k8s component.

Detailed Description

NOTE : this is a very rough draft based on working group meeting (recording here). Will evolve this as we continue the discussion with the wider community and come up with implementation details. Just hoping to get the discussion started with this issue.

  1. Build k/k binaries (e2e.test, kubectl, ginkgo)

  2. (optionally) build k8s component images from private SHA (if the images aren't already available on a registry)

  3. Create a cluster with custom k8s binaries & container images In order to use a custom k8s build (example k/k master), there are a few different options:

    • Build a custom image with image-builder as part of CI and use that image in the cluster
    • pros: can reuse the image for multiple nodes
    • cons: time consuming, building a VM image with packer takes ~20 minutes
    • Use an existing image (possibly with a different k8s version) and in the KubeadmConfig pass in a PreKubeadmCommand script to replace the k8s version with the one we want.
    • pros: doesn't require building an image, faster
    • cons: we have to do this for every vm, hacky (bash script might be error prone), different from user experience with capi (with images)
    • Modify capi infra providers to take custom k8s component images
      • pros: can be reused more easily by users not familiar with the project and CI, doesn't require the preKubeadm "hack" script, or reusing a VM image.
    • cons: more work and changes involved.
  4. Run tests suite: k/k E2E, cloud provider E2E, other k8s-sigs E2E, etc.

related to #2141, which might overlap in implementation details but has different objectives: #2141 aims to test if capi/capz/capa/capv/capg/etc passes k8s conformance whereas this proposal would be to use CAPI as a dev tool to test k8s and k8s-sigs changes.

/kind proposal

cc @dims @vincepri @alexeldeib @fabriziopandini @ritazh @chewong @randomvariable @rbitia

vincepri commented 4 years ago

Thanks for the write-up Cecile!

I'd expand the 4th goal mentioning we intend to extend the current test/framework to allow this proposal to be implemented there.

Modify capi infra providers to take custom k8s component images

Would this entail changes to image-builder to take some custom scripts that can setup images in a custom fashion? We should probably tackle this separately, it's a really interesting idea and would make the images generic, although I assume folks will probably need internet access so it might not work in every environment.

timothysc commented 4 years ago

Modify capi infra providers to take custom k8s component images pros: can be reused more easily by users not familiar with the project and CI, doesn't require the preKubeadm "hack" script, or reusing a VM image.

+1 ^ this is generally more useful for testing, but I still see a problem with the kubelet. You can override almost everything else, but the kubelet running on the base OS built by the image builder is not easily replaced unless you combined a rpm/deb update/install on cloud-init.

CecileRobertMichon commented 4 years ago

@vincepri @timothysc what I meant by

Modify capi infra providers to take custom k8s component images

Is that instead of using preKubeadmCommand to pass in the script that overrides the k8s version, we add a new property, maybe under a feature gate, to pass in a "custom" k8s version (what we call CI_VERSION in the script above, or custom k8s components images, and run a script to install that version on the VMs before running the bootstrap script or as part of the bootstrap script.

A better place for this might actually be the bootstrap provider, not the infra providers now that I think about it. @vincepri I don't think this entails changes to image builder as I'm not talking about building any new images but rather using cloud init to install k8s components during provisioning. This does require internet access but so does our current preKubeadmCommand solution. The advantage here is that it would be more reusable and we could use it with a combination of a user preKubeadmCommand.

@timothysc for kubelet we'd need to do a systemctl restart kubelet after installing the desire kubelet binary, just like we do it in the preKubeadmCommand right now.

The other possibility is to change kubeadm to allow passing in custom component images (if it's not already supported, I don't think it is from what I've seen). So your kubeadm config would look something like:

kubeadmConfigSpec:
    initConfiguration:
      nodeRegistration:
        name: '{{ ds.meta_data["local_hostname"] }}'
        customKubeletVersion: v1.19.0-alpha.1.175+7b1a531976be0d
        kubeletExtraArgs:
          cloud-provider: azure
          cloud-config: /etc/kubernetes/azure.json
    joinConfiguration:
      nodeRegistration:
        name: '{{ ds.meta_data["local_hostname"] }}'
        customKubeletVersion: v1.19.0-alpha.1.175+7b1a531976be0d
        kubeletExtraArgs:
          cloud-provider: azure
          cloud-config: /etc/kubernetes/azure.json
    clusterConfiguration:
      apiServer:
        timeoutForControlPlane: 20m
        customImage: myDockerHubUser/custom-api-server-build:v1.19.0-dirty
        extraArgs:
          cloud-provider: azure
          cloud-config: /etc/kubernetes/azure.json
        extraVolumes:
           - [...]
      controllerManager:
        customImage: myDockerHubUser/custom-controller-manager-build:v1.19.0-dirty
        extraArgs:
          cloud-provider: azure
          cloud-config: /etc/kubernetes/azure.json
          allocate-node-cidrs: "false"
        extraVolumes:
          - [...]

And have kubeadm pull the right images / components before init/join in cloud init. Basically I'm just trying to think of ways we can build a k8s cluster with custom builds of various k8s components installed without having to build VM images in every test.

alexeldeib commented 4 years ago

FYI, in 1.16 kubeadm started supporting Kustomize patches (-k flag) on the static manifests. Might be useful:

detiber commented 4 years ago

I kind of like the idea of adding support to the bootstrap provider (and hiding it behind a feature gate). It would allow us to recreate the existing functionality in a more centralized and re-usable way than exists today.

If nothing else it would provide a good stopgap until we can better define an automated pipeline where we could consume images that are automatically built using image-builder from the latest k8s artifacts.

detiber commented 4 years ago

FYI, in 1.16 kubeadm started supporting Kustomize patches (-k flag) on the static manifests. Might be useful:

We would either need to validation against k8s requested version and kustomize patches or wait until we are ready to declare that we are only willing to support workload clusters >= v1.16 if we go down that path.

fabriziopandini commented 4 years ago

I'd expand the 4th goal

+1 to this, I would like also to consider the idea of having Cluster APi conformance tests (as a next step for the work started with https://github.com/kubernetes-sigs/cluster-api/issues/2753)

The other possibility is to change kubeadm to allow passing in custom component images

This should be already possible, I can give examples if required.

CecileRobertMichon commented 4 years ago

@fabriziopandini would love examples if you have them

elmiko commented 4 years ago

just to add an extra layer to this conversation, i am looking at contributing some e2e tests for the kubernetes autoscaler that use cluster-api. although we will start by using the docker provider to help keep the resources low, i think it would not be difficult to have these tests also use cloud providers at some point.

vincepri commented 4 years ago

/milestone v0.3.x

CecileRobertMichon commented 4 years ago

@vincepri with the new v1alpha3+ roadmap should this be 0.3.x or 0.4.x?

vincepri commented 4 years ago

@CecileRobertMichon This could be added to v0.3.x in a backward compatible way. I'm unclear though if we have folks interested in working on it.

CecileRobertMichon commented 4 years ago

is it okay to mark this with help wanted? I can probably help with some of it but I don't think I have bandwidth to work on it full time right away.

vincepri commented 4 years ago

/help

k8s-ci-robot commented 4 years ago

@vincepri: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/2826): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
vincepri commented 4 years ago

/kind cleanup

randomvariable commented 4 years ago

/assign

CAPA conformance was giving me grief so i kind of started doing it.

/lifecycle active

elmiko commented 4 years ago

hey @randomvariable, just an update to the comment i made previously in this thread. i have started to hack on an experiment where i have broken out the autoscaler tests from upstream and started to make them into a library.

the general idea is that currently the upstream autoscaler tests are heavily tied into the gce/gke provider. i am working towards rewriting the provider interface so that it could be used generically (ie more widely applicable abstractions). the end result from this would be a series of tests that can be consumed as a lirbary with the user passing in a provider during their compile, in essence providing a generic suite of tests that can be consumed from the outside (no included providers).

i certainly don't think you should wait for me, but i wanted to let you know what i've been hacking on.

vincepri commented 4 years ago

/milestone v0.4.0

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fabriziopandini commented 4 years ago

/remove-lifecycle stale

fabriziopandini commented 4 years ago

@CecileRobertMichon might be worth to check if there is still work to be done here now then #3652 is merged

fabriziopandini commented 3 years ago

/area testing

CecileRobertMichon commented 3 years ago

@fabriziopandini @randomvariable I believe the last thing remaining here is allowing to build custom k8s versions in order to be able to use CAPI to test k8s PRs (something like https://github.com/kubernetes/test-infra/commit/9778b6a1462f27b869832241354249a8207e7004#diff-27a83c428d1eeb41626127495412ac1986b79e109606aacde96ac4c5b3f896a2R840). Not sure if CAPI framework is the best place to do this in or if this should be a test-infra helper.

fabriziopandini commented 3 years ago

@CecileRobertMichon IMO this problem has some variants because not only it is required to build custom k8s versions but it is required to get them the into the machine's images. For CAPD, I'm leveraging on kind build to get a custom node image before starting all the tests (so I can keep the test phase consistent/working with limited resources). For the other providers, @randomvariable implemented a solution that downloads CI artifacts into each machines using a pre-kubeadm script, might be helpful to link here an example of this can be used from CAPA

CecileRobertMichon commented 3 years ago

@fabriziopandini I would expect the solution of loading the artifacts on each machine to be the same. The only difference would be, instead of getting an existing CI version at https://dl.k8s.io/ci/latest.txt and using the already built image stored on gcr, we would need to build the image from source and load that onto the machines.

fabriziopandini commented 3 years ago

@CecileRobertMichon FYI in kubeadm we are using images from CI only, because we determined that having a small delay from them tip of Kubernetes is not a problem, especially given code freeze near release. So I personally think CAPD - with its own build from source code - as an exception, not the rule we have to follow

jsturtevant commented 3 years ago

Trying to understand my options and roadmap for running upstream k8s e2e tests using CAPI (starting to look at this for Windows). Looks like we have a few options.

This is attempt to summarize where we are at right now:

in CAPZ we also have

Is the end goal to be able to support all this functionality fully via kubetest2 deployer?

fabriziopandini commented 3 years ago

@jsturtevant IMO #3652 and #4041 are dealing with two different goals:

  1. Ensure Cluster API stability (Test Cluster API itself)
  2. Test Kubernetes using Cluster API (Cluster API as a release blocker in Kubernetes)

I think that for this specific issue 2/#4041 is the most appropriate answer but I'm not sure if/how the two things could converge. WRT to this might be that using kubetest as a package name in #3652 wasn't a good choice...

vincepri commented 3 years ago

/milestone v0.4.x

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fabriziopandini commented 3 years ago

/remove-lifecycle rotten @CecileRobertMichon what about breaking down this issue in a set of small actionable items?

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

randomvariable commented 3 years ago

/unassign

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/2826#issuecomment-1004857711): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.