rancher-sandbox / cluster-api-provider-harvester

A Cluster API Infrastructure Provider for Harvester
Apache License 2.0
21 stars 6 forks source link

New cluster using Talos is not progressing beyond Machines in Provisioning stage. #37

Open dhaugli opened 4 months ago

dhaugli commented 4 months ago

What happened: [A clear and concise description of what the bug is.]

The cluster is not coming up, Harvester Loadbalancer is not created, machines never leave provisioning state. The machines is provisioned in harvester, gets IP from my network. I can attach a console to them. Though its Talos so its not much you get in return.

Screenshot of console of one of the talos cp vms:

Screenshot 2024-06-06 232557

caph-provider logs:

1)

 ERROR   failed to patch HarvesterMachine        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile.func1
        /workspace/controllers/harvestermachine_controller.go:121
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile
        /workspace/controllers/harvestermachine_controller.go:198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226
2024-06-06T19:58:10Z    ERROR   Reconciler error        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226

2) These two log entries keeps going.

 2024-06-06T19:58:10Z    INFO    Reconciling HarvesterMachine ...        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82"}
2024-06-06T19:58:10Z    INFO    Waiting for ProviderID to be set on Node resource in Workload Cluster ...       {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01"}

capt-controller-manager logs:

I0606 19:58:08.737945       1 taloscontrolplane_controller.go:176] "controllers/TalosControlPlane: successfully updated control plane status" namespace="cluster-capi-mgmt-p-01" talosControlPlane="capi-mgmt-p-01" cluster="capi-mgmt-p-01"
I0606 19:58:08.739615       1 controller.go:327] "Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
E0606 19:58:08.739746       1 controller.go:329] "Reconciler error" err="at least one machine should be provided" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
I0606 19:58:08.749008       1 taloscontrolplane_controller.go:189] "reconcile TalosControlPlane" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="c37dc309-f8fb-42c7-a375-5faceb9019b9" cluster="capi-mgmt-p-01"
I0606 19:58:09.190175       1 scale.go:33] "controllers/TalosControlPlane: scaling up control plane" Desired=3 Existing=1
I0606 19:58:09.213294       1 taloscontrolplane_controller.go:152] "controllers/TalosControlPlane: attempting to set control plane status"
I0606 19:58:09.220900       1 taloscontrolplane_controller.go:564] "controllers/TalosControlPlane: failed to get kubeconfig for the cluster" error="failed to create cluster accessor: error creating client for remote cluster \"cluster-capi-mgmt-p-01/capi-mgmt-p-01\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://10.0.0.113:6443/api/v1?timeout=10s\": tls: failed to verify certificate: x509: certificate is valid for 10.0.0.3, 127.0.0.1, ::1, 10.0.0.5, 10.53.0.1, not 10.0.0.113"

cabpt-talos-bootstrap(I dont know if this is relevant):

I0606 19:58:09.206570       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.224117       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.243118       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.280372       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.341804       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.352557       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.439369       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.480714       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.539945       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.548156       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.717884       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.720944       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-7shhp"
I0606 19:58:09.756344       1 talosconfig_controller.go:223] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: ignoring an already ready config"
I0606 19:58:09.765995       1 secrets.go:243] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: updating talosconfig" endpoints=null secret="capi-mgmt-p-01-talosconfig"

What did you expect to happen: I expected that the caph provider created the LB and proceeded on creating the cluster.

How to reproduce it:

I added the providers for talos (boostrap and controlplane) and ofcourse the harvester provider.

Added 4 files + the harvester secret with the following configuration:

cluster.yaml:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 172.16.0.0/20
    services:
      cidrBlocks:
        - 172.16.16.0/20
    serviceDomain: cluster.local
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: capi-mgmt-p-01
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterCluster
    name: capi-mgmt-p-01

harvester-cluster.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterCluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  targetNamespace: cluster-capi-mgmt-p-01
  loadBalancerConfig:
    ipamType: pool
    ipPoolRef: k8s-api
  server: https://10.0.0.3
  identitySecret: 
    name: trollit-harvester-secret
    namespace: cluster-capi-mgmt-p-01

harvester-machinetemplate.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterMachineTemplate
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  template: 
    spec:
      cpu: 2
      memory: 8Gi
      sshUser: ubuntu
      sshKeyPair: default/david
      networks:
      -  cluster-capi-mgmt-p-01/capi-mgmt-network
      volumes:
      - volumeType: image 
        imageName: harvester-public/talos-1.7.4-metalqemu
        volumeSize: 50Gi
        bootOrder: 0

controlplane.yaml:

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  version: "v1.30.0"
  replicas: 3
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterMachineTemplate
    name: capi-mgmt-p-01
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: v1.7.4
      configPatches:
        - op: add
          path: /cluster/network
          value:
            cni:
              name: none

        - op: add
          path: /cluster/proxy
          value:
            disabled: true

        - op: add
          path: /cluster/network/podSubnets
          value:
            - 172.16.0.0/20

        - op: add
          path: /cluster/network/serviceSubnets
          value:
            - 172.16.16.0/20

        - op: add
          path: /machine/kubelet/extraArgs
          value:
            cloud-provider: external

        - op: add
          path: /machine/kubelet/nodeIP
          value:
            validSubnets:
              - 10.0.0.0/24

        - op: add
          path: /cluster/discovery
          value:
            enabled: false

        - op: add
          path: /machine/features/kubePrism
          value:
            enabled: true

        - op: add
          path: /cluster/apiServer/certSANs
          value:
            - 127.0.0.1

        - op: add
          path: /cluster/apiServer/extraArgs
          value:
            anonymous-auth: true

Anything else you would like to add:

I have tried to switch the Loadbalancer config from dhcp to ipPoolRef, and set a pre-configured ippool this also did not work. I think its related to that the LB is never provisioned in the first place.

[Miscellaneous information that will assist in solving the issue.]

Environment:

ekarlso commented 4 months ago

So after looking around and thinking a bit I see that our CAPHV is waiting for providerID to be set

2024-06-07T06:13:18Z    INFO    Waiting for ProviderID to be set on Node resource in Workload Cluster ...   {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-7d7pr","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-7d7pr", "reconcileID": "d258bfe4-ba85-4d61-92e0-d6ee8aced78d", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-n48cx", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01"}

I see that you are in your examples including the CPI as a DaemonSet that means that will not be setting the provideID on TalOS since it needs to be bootstraped before the DaemonSet would be started and the CPI setting providerID? https://github.com/rancher-sandbox/cluster-api-provider-harvester/blob/main/templates/cluster-template-kubeadm.yaml#L190

IMHO the controller should be able to get the HarvesterMachine into a state so that the Machine object would phase into Provisioned so that other controllers whether it be Talos' or any else will function with it which is the normal ?

PatrickLaabs commented 4 months ago

Hi @dhaugli, which version of the rke2 controlplane and bootstrap provider are you using?

dhaugli commented 4 months ago

Hi @dhaugli, which version of the rke2 controlplane and bootstrap provider are you using?

We are using Talos bootstrap and Talos controlplane provider in this case.

dhaugli commented 3 months ago

I have followed the example now from templates, but still it dosent work, and I think I know why. Beacause the caph controller dosent propagate the IP address of the machines into the machine object like:

status:
  addresses:
  - address: <IP>
    type: ExternalIP
  - address: <IP>
    type: ExternalIP
  - address: <DNS NAME OF MACHINE>
    type: InternalDNS

For reference, Vsphere capi controller does this, without this the TalosBoot controller can't see the ip and can't continue the bootstrap process. But my machines does get IP in my network and the qemu agent does report this through harvester.

dhaugli commented 3 months ago

I found the issue with the CAPH controller, from the principles from Cluster API on how the bootstrap should work:

image

CAPH controller does not set the machine as ready in the infrastructure provider (even though its running just fine as a VM in Harvester), because CAPH controller is waiting for Provider Id, and the LB is never created (because of this) and with Talos this will just make the nodes end up waiting forever in the bootstrap process, and will not progress.

My friend Endre just made a fix in our own image, still dosen't work, but we are working on it as well.