rancher / cluster-api-provider-rke2

RKE2 bootstrap and control-plane Cluster API providers.
Apache License 2.0
82 stars 28 forks source link

Empty spec.template.spec.version makes ClusterClass fail #354

Closed anmazzotti closed 2 months ago

anmazzotti commented 3 months ago

What happened: [A clear and concise description of what the bug is.]

In my ClusterClass, the RKE2ControlPlaneTemplate is defined as follows (note: no spec.template.spec.version):

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: RKE2ControlPlaneTemplate
metadata:
  name: rke2-control-plane
spec:
  template:
    spec:
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: ElementalMachineTemplate
        name: rke2-control-plane
      serverConfig:
        disableComponents:
          kubernetesComponents:
            - cloudController
      nodeDrainTimeout: 2m
      registrationMethod: "control-plane-endpoint"
      rolloutStrategy:
        type: "RollingUpdate"
        rollingUpdate:
          maxSurge: 1

Cluster definition carries the version in spec.topology.version as expected:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  labels:
    cni: ${CLUSTER_NAME}-crs-0
    crs: "true"
  name: ${CLUSTER_NAME}
  namespace: ${NAMESPACE}
spec:
  clusterNetwork:
    services:
      cidrBlocks: ${SERVICE_CIDR:=["10.96.0.0/12"]}
    pods:
      cidrBlocks: ${POD_CIDR:=["10.244.0.0/16"]}
    serviceDomain: ${SERVICE_DOMAIN:="cluster.local"}
  topology:
    class: rke2
    version: "v${KUBERNETES_VERSION:=1.30.1}+rke2r1"
    controlPlane:
      metadata: {}
      replicas: ${CONTROL_PLANE_MACHINE_COUNT:=1}
    workers:
      machineDeployments:
      - class: rke2-default-worker
        name: md-0
        replicas: ${WORKER_MACHINE_COUNT:=1}
    variables:
    - name: controlPlaneEndpointHost
      value: ${CONTROL_PLANE_ENDPOINT_HOST:=""}
    - name: controlPlaneEndpointPort
      value: ${CONTROL_PLANE_ENDPOINT_PORT:=6443}
    - name: vipInterface
      value: ${VIP_INTERFACE:=eth0}

However application fails with:

The RKE2ControlPlaneTemplate "rke2-control-plane" is invalid: spec.template.spec.version: Invalid value: "": spec.template.spec.version in body should match 'v(\d\.\d{2}\.\d)\+rke2r\d'

What did you expect to happen:

No failure. The version should be read from the Cluster.spec.topology.version.

How to reproduce it:

See above or check out the reference

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

furkatgofurov7 commented 2 months ago

Resurfacing the discussion details and my findings we had offline on slack:

I am a bit unsure why not just set spec.version on RKE2ControlPlane and use that in your case? My argument is, that field is required in CC API contract (ref), this specifically:

Impact on the controlPlane providers:

the provider implementers are required to implement the ControlPlaneTemplate type (e.g. KubeadmControlPlaneTemplate etc.).
it is also important to notice that:
ClusterClass and managed topologies can work only with control plane providers implementing support for the spec.version field; Additionally, it is required to provide support for the status.version field reporting the minimum API server version in the cluster as required by the control plane contract.

So, I mean version should be read from spec.template.spec.version by design as in the logs you are seeing.

On the other note, I also previously mentioned it, CAPRKE2 API diverged from KCP in terms of what fields they/we have and not quiet following it, since CC was designed with KCP in mind. To align with KCP, we need to re-organize/shuffle our API fields to make our API (fields) match with KCP and that would need probably an API bump. I think, generally this issue is not really a bug but a neat, since we are used to KCPs API structure, however, CAPRKE2 API is different and it has the reasons why it is like that I believe.

anmazzotti commented 2 months ago

@furkatgofurov7 I think this is still a bug.

The main problem is that the validation is too strict and doesn't allow to write a CC that uses this version as a variable. For example this will fail with the same error I posted above:

    - name: rKE2ControlPlaneTemplate
      definitions:
        - selector:
            apiVersion: controlplane.cluster.x-k8s.io/v1beta1
            kind: RKE2ControlPlaneTemplate
            matchResources:
              controlPlane: true
          jsonPatches:
            - op: add
              path: "/spec/template/spec/version"
              valueFrom:
                variable: k8sVersion

Am I doing anything wrong? op: create also does not work. Do you have any example of a RKE2 Cluster Class using this version field that works?

It is the same problem as #343, just on a different field. The strict validation prevents the Cluster Class from working, unless you hardcode all the values of course.

Again, maybe I'm doing it wrong, not sure, if so please update the CC example in this repo.

Thank you.

anmazzotti commented 2 months ago

As mentioned by @alexander-demicev , the version validation was tweaked in one of the latest releases. I tested with 0.5.0 and I can confirm it works now. I had to bring the k8s version variable inside the Cluster Class definition, but it does work once that is done. Reference

I don't want to take the liberty to close the issue, but it can be closed from my point of view. Thank you.