K8SSAND-1511 ⁃ Remove templating when deploying CRDs in e2e tests

adutra commented 2 years ago

We currently rely on some Go templates in our e2e tests. Typically, the e2e test framework will first create a kustomization file using Go templates, then invoke kustomize build | kubectl apply on it.

This is currently done mostly to cope with dynamically generated data. More specifically, it is done to:

Change namespaces (since they are randomly generated);
Change image and tag names (when running the test against a specific image instead of latest);
Change the name of ingress routes and the target hosts (for Stargate and Reaper API tests);
Change fixtures: this is by far the heaviest templating. Templates are used to change:
- k8s context names (the default ones are valid for Kind local clusters, when running on cloud providers they need to change)
- topology zone names (the default ones are valid for Kind local clusters, when running on cloud providers they need to change)
- enable / disable host networking (host networking is enabled by default for Kind clusters to work properly, but it should be disabled on cloud providers)
- storage class names (the default one is for Kind clusters)

We would like to remove this templating because it makes it hard for the user to understand what resources are actually being deployed, and makes the fixtures less useful as self-contained examples. Also, the templates themselves are hard to maintain.

All the above could be replaced with static kustomization files and a few changes. I will outline my suggestions below:

Namespaces

Instead of using generated namespaces I suggest that we use fixed namespaces. The namespace for the fixtures should simply be the fixture name itself, so if a fixture is called "multi-dc-reaper" then its namespace will be multi-dc-reaper.

The namespace will be created as it is created currently, that is, programmatically at the beginning of the test. Note: the fixture namespace will also be used for deploying the operator itself (except for cluster-scoped deployments).

Fixtures and Test profiles

I suggest that we consider the fixture folder and its CRDs as the kustomization base to apply.

I also suggest that we create the notion of "test profiles", that will correspond to kustomization overlays for different environments.

Finally, I suggest that each fixture should contain subfolders for the control plane and each data plane. This allows different resources to be deployed in the right context.

See an example below of the typical layout of a fixture folder.

Regarding test profiles:

The default test profile would be "kind-local", and there would also be at least another test profile, "gke".
At this point imo we don't need a "gha" test profile, but this could be introduced later, to kustomize things such as heap sizes, etc.
The test profile can be changed via a flag.
Each test profile would materialize as a kustomization overlay, in a a subfolder inside each fixture folder.
- The "kind-local" profile could e.g. turn on host networking, which could consequently be removed from the base fixture CRD itself (since it's not required on actual clusters, only on Kind local ones).
- The "gke" profile could e.g. change the context names, the zone names, and the storage class name to match GKE standards.

Image names

Image names usually don't need to be changed for local and GHA testing. However it needs to be changed for GKE if we want the test to fetch a custom image that was just built and pushed to GCP Artifact Repository.

Image names for GKE will then be kustomized through the gke test profile / overlay.

Another usage of custom image names is ad-hoc testing. For such cases, I suggest that the tester modify the test profile manually and specify the desired image.

Context names

Context names will be kustomized through test profiles / overlays.

I suggest using fixed context names for GKE, such as gke-k8ssandra-0, gke-k8ssandra-1, etc. which then would map one-to-one to default ones: kind-k8ssandra-0, kind-k8ssandra-1, etc. This means the context names need to be created or renamed in the kubeconfig file with the appropriate names prior to running the test.

Zone names

Zone names will be kustomized through test profiles / overlays.

I suggest that for GKE, the affinity labels be changed from the default one topology.kubernetes.io/zone to another one, e.g. k8ssandra.io/rack. This is because the default label is tightly coupled to AZ names and this is not portable from a cluster to another. Using another, proprietary label enhances the portability of fixtures. Of course, GKE nodes will need to be adequately labeled prior to deploying the fixture.

Host networking

Host networking will be kustomized through test profiles / overlays. I suggest switching off host networking in the base fixtures, and turning it on only for the kind-local test profile / overlay.

Storage

Storage will be kustomized through test profiles / overlays.

Ingress routes

Ingress routes will be made part of the fixture set, instead of being deployed programmatically. This can be done now that #523 made Traefik a prerequisite.

The example routes will be kustomized for each test profile / overlay, and for each data plane. The kustomization will change the route matching rule and the target service name.

One problem that we have is that the current routes use a Host rule that matches the entire host name. That includes the IP of the endpoint, which is not portable. I suggest using HostRegexp instead, which allows us to have one rule match any external IP, e.g. :

HostRegexp(`reaper.my-fixture.{ip:[0-9\\.]+}.nip.io`)I also suggest that we change our nip.io URLs to include the fixture name / namespace, e.g. `reaper.my-fixture.W.X.Y.Z.nip.io`. This will reduce the risk of interference between tests.

Stargate config maps

These are currently deployed programmatically. I suggest that they be incorporated into the fixture set and deployed in each data plane.

Encryption secrets

These are currently deployed programmatically. I suggest that they be incorporated into the fixture set and deployed in each data plane.

Example of a fixture folder layout

The general layout of a fixture folder is as follows, assuming that my-fixture is a multi-cluster test with 2 data planes:

my-fixture
├── control-plane
│   ├── base
│   ├── components
│   └── overlays
├── data-plane-0
│   ├── base
│   ├── components
│   └── overlays
└── data-plane-1
    ├── base
    ├── components
    └── overlays

At the beginning of each test, the E2E framework would do the equivalent of the following commands:

cd my-fixture
kustomize build control-plane/overlays/$TEST_PROFILE | kubectl apply --context $CONTROL_PLANE -f
kustomize build  data-plane-0/overlays/$TEST_PROFILE | kubectl apply --context $DATA_PLANE0 -f
kustomize build  data-plane-1/overlays/$TEST_PROFILE | kubectl apply --context $DATA_PLANE1 -f

This is the full layout of that fixture folder; assuming that it deploys a K8ssandraCluster in the control plane, and also deploys ingress routes for Reaper and Stargate, as well as Stargate config maps, in all data planes:

my-fixture
├── control-plane
│   ├── base
│   │   ├── k8ssandra.yaml
│   │   └── kustomization.yaml 
│   ├── components
│   │   ├── gke
│   │   │   ├── contexts
│   │   │   │   └── kustomization.yaml
│   │   │   ├── image
│   │   │   │   └── kustomization.yaml
│   │   │   ├── storage
│   │   │   │   └── kustomization.yaml
│   │   │   └── topology
│   │   │       └── kustomization.yaml
│   │   └── kind-local
│   │       └── host-networking
│   │           └── kustomization.yaml
│   └── overlays
│       ├── gke
│       │   └── kustomization.yaml
│       └── kind-local
│           └── kustomization.yaml
├── data-plane-0
│   ├── base
│   │   └── kustomization.yaml 
│   ├── components
│   │   └── ingress
│   │       ├── reaper
│   │       │   └── kustomization.yaml
│   │       └── stargate
│   │           └── kustomization.yaml
│   └── overlays
│       ├── gke
│       │   └── kustomization.yaml
│       └── kind-local
│           └── kustomization.yaml
└── data-plane-1
    ├── base
    │   └── kustomization.yaml
    ├── components
    │   └── ingress
    │       ├── reaper
    │       │   └── kustomization.yaml
    │       └── stargate
    │           └── kustomization.yaml
    └── overlays
        ├── gke
        │   └── kustomization.yaml
        └── kind-local
            └── kustomization.yaml

Admittedly, this is a lot of kustomization files, but I think this is the price to pay to get rid of templating entirely.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1511 ┆priority: Medium

jeffbanks commented 2 years ago

@adutra take a look at this model that was created for cloud-readiness and currently being used.

I wonder if this would be an alternative way to solution your test inputs, dynamic values, etc. for e2e tests? It uses a similar approach but with Golang structures instead of kustomize test fixtures. The models were created to manage not only TF provisioning, but for k8ssandra installation as well. Same model, two different activities that are decoupled from one another.

adutra commented 2 years ago

@jeffbanks thanks I will have a look 👍 There is definitely some overlap here, especially around the deployment of the operators. However the idea here is to also change how we deploy the test fixtures themselves, which is something that cloud-rediness is not supposed to support (unless I'm missing something).

adutra commented 2 years ago

I have a working prototype that works on my machine, see #546 . The only test that works for now is RemoveTemplating, but I'm able to deploy a multi-cluster fixture with all the ingresses correctly configured.

adutra commented 2 years ago

the fixture namespace will also be used for deploying the operator itself (except for cluster-scoped deployments).

In fact, I think this won't be possible. The operator is deployed by building the config/deployments dir, and the namespace is k8ssandra-operator. I think we'll need to always deploy the operator in cluster-scope mode (install in k8ssandra-operator and watch all namespaces).

jsanda commented 2 years ago

In fact, I think this won't be possible.

@adutra what won't be possible, deploying the operator in a different namespace when it is configured to only watch a single namespace? If that is the case, then that is a bug that needs to be fixed, separate from this issue of course.

jsanda commented 2 years ago

There shouldn't be an issue with deploying the operator in the fixture namespace. I did a quick sanity check. It should just be a matter of using a namespace transformer in the kustomizations at this level:

my-fixture
├── control-plane
│
│
│
├── data-plane-0
│
│
│
|── data-plane-1

I was thinking about including k8ssandra.yaml in the control-plane kustomization. That may not work, at least not without some changes. A lot of the tests set the k8sContext property. That requires the ClientConfigs to be deployed. We cannot generate the ClientConfigs until the operator is deployed. If we try deploying k8ssandra.yaml along with the operator the validating webhook should reject it since we won't yet have a ClientConfig. I have mentioned in other discussions that I don't think the k8sContext check should be performed in the webhook. I think it should be handled with a status condition, maybe something like ValidConfiguration for example. This could work but it creates a lot of noise in the operator logs which can could make debugging harder.

I think it would be better to keep the fixture deployment separate. We should

Deploy the operator in the control plane data plane
Create ClientConfigs
Check that the operator is ready
Deploy the fixture

I think this is more or less the steps we have in place today, but we should push as much as we can out of Go code and instead do it declaratively.

For the context names why not just keep the same whether we are running against kind, GKE, or something else? Wouldn't that keep the test code simpler? I want to err on the side of simplicity as much as possible :)

I want to suggest another approach for host networking and storage. We have had #63 open for some time now. The idea with the operator configuration is to supply various settings and defaults via a ConfigMap. cass-operator already does this for images. For dev/test environments running with kind in particular we could specify that host networking should be enabled by default. This would save having to declare it in each K8ssandraCluster manifest. Similarly, we could specify a default storage class to be used. This would simplify configuring the fixtures and would also be generally useful outside of tests. wdyt?

adutra commented 2 years ago

There shouldn't be an issue with deploying the operator in the fixture namespace. I did a quick sanity check. It should just be a matter of using a namespace transformer in the kustomizations at this level:

Yes that would work, but I'm reluctant to add one more kustomization file just to declare a namespace transformer. Moreover, as you pointed out, that kustomization file needs to be applied when deploying the operator, while the others are meant to be applied when deploying the fixture. We would need to create a specific subfolder for that, maybe:

my-fixture
├── operator-deployment
├── fixture-deployment
    ├── control-plane
    ├── data-plane-0
    |── data-plane-1

That said, I wonder if we couldn't just deploy all the fixtures in the k8ssandra-operator namespace? (Except the cluster-scoped ones of course.) This would greatly simplify the deployment phase and spare us the creation of one extra namespace.

For the context names why not just keep the same whether we are running against kind, GKE, or something else? Wouldn't that keep the test code simpler? I want to err on the side of simplicity as much as possible :)

That would work too, but with one caveat: the contexts created with Kind won't get automatically deleted when you run kind delete clusters --all. We can add an extra step in some makefile targets to work around that. If that's ok for you, then I'm in.

I want to suggest another approach for host networking and storage. We have had https://github.com/k8ssandra/k8ssandra-operator/issues/63 open for some time now. The idea with the operator configuration is to supply various settings and defaults via a ConfigMap.

I agree with the idea to use a ConfigMap to configure the operator itself. I'm a bit reluctant about doing so to modify user objects. What makes me uncomfortable is that we would be modifying resources without telling the user what and why. All in all, I think I still prefer the verbosity of a kustomize-based solution.

jsanda commented 2 years ago

Deploying all fixtures to the k8ssandra-operator namespace would certainly simplify things; however, it would be good to have at least one test that does deploy to a different namespace to verify deploying into a different namespace does work correctly. Whether the target namespace is k8ssandra-operator or something else, the test framework should ideally perform the same setup and steps. In other words, the framework is just going to execute kustomize build ... | kubectl apply ... either way.

For the k8s context names, which approach do you think is ultimately simpler? The mapping requires a bit of extra work on the test side. Keeping the context names the same potentially impacts developer workflows since we can't use kind delete clusters --all. For what it's worth I always use kind delete clusters --name <cluster>. Will that still work?

I'm a bit reluctant about doing so to modify user objects. What makes me uncomfortable is that we would be modifying resources without telling the user what and why.

We would be applying default values. We already apply default values via kubebuilder annotations. This gives us a way to change what those defaults would be at runtime. This would only apply for setting defaults. It would not modify existing clusters.

jsanda commented 2 years ago

@adutra to be clear the changes I described for setting defaults for host networking and storage I am proposing that they be done for #63 independent of this ticket. Assuming those changes are made for #63, I thought that they could help further simplify things here.

adutra commented 2 years ago

Deploying all fixtures to the k8ssandra-operator namespace would certainly simplify things; however, it would be good to have at least one test that does deploy to a different namespace to verify deploying into a different namespace does work correctly.

Fair enough, let's deploy to a different namespace then.

For what it's worth I always use kind delete clusters --name . Will that still work?

The command will work, the resources will be deleted, and the cluster entry and the user entry will be deleted from your kubeconfig file; but the context entry will remain in your kubeconfig file (and will become invalid since it would point to a non-existent cluster and a non-existent user). I don't think it's a huge problem tbh, but something to be aware of.

We would be applying default values. We already apply default values via kubebuilder annotations. This gives us a way to change what those defaults would be at runtime. This would only apply for setting defaults. It would not modify existing clusters.

OK, then I agree to use it for, say, image defaults. But for hostNetwork we would still have a problem imo. The field is a bool; it is thus impossible to differentiate the two definitions below:

spec:
  cassandra:
    networking:
      hostNetwork: false #explicitly set to false

spec:
  cassandra:
    networking: {} # implicitly set to false (default value)

In the first snippet, should the operator override hostNetwork, even if it was explicitly set to false? It doesn't sound right to me. Moreover, not all deployments need hostNetwork true with Kind clusters, only multi-dc ones.

host networking and storage I am proposing that they be done for https://github.com/k8ssandra/k8ssandra-operator/issues/63 independent of this ticket

OK so I suggest that for now we add kustomizations for these settings, and in #63 we can revisit that later.

jsanda commented 2 years ago

The hostNetwork property is a bool, but the Networking property is a pointer.

not all deployments need hostNetwork true with Kind clusters, only multi-dc ones.

Very true. The defaults defined in the ConfigMap can be set according to one's needs.

I suggest that for now we add kustomizations for these settings, and in https://github.com/k8ssandra/k8ssandra-operator/issues/63 we can revisit that later.

👍

adutra commented 2 years ago

The hostNetwork property is a bool, but the Networking property is a pointer.

That doesn't change much the situation I'm afraid. What if the user wants to specify nodePort ? That is, how would you distinguish the following snippets?

spec:
  cassandra:
    networking:
      nodePort: 
        native: 9142
      hostNetwork: false #explicitly set to false

spec:
  cassandra:
    networking:
      nodePort: 
        native: 9142
        # hostNetwork implicitly set to false (default value)

But let's table that for now and come back to it when implementing #63.

adutra commented 2 years ago

As discussed in a team meeting 2 days ago, we are going to table this for now until we get a clear vision of what we want to achieve wrt to running tests in the cloud.

k8ssandra / k8ssandra-operator