siderolabs / cluster-api-control-plane-provider-talos

A control plane provider for CAPI + Talos
Mozilla Public License 2.0
56 stars 19 forks source link

Unable to bootstrap Talos Worker Cluster on VMWare ESXi 7.0.3 - `no addresses were found for node` #181

Closed julien-sugg closed 9 months ago

julien-sugg commented 9 months ago

Greetings,

We are facing some issues bootstrapping a worker cluster on VMWare ESXi 7.0.3 using the vSphere CAPV and Talos documentations.

Versions

Description

We manually created a first Talos cluster on VMWare using the OVA & talosctl, and then installed the appropriate operators using clusterctl. The cluster has the following minimal patches and use the defaults otherwise:

# management.patch.yaml
- op: add
  path: /machine/network
  value:
    interfaces:
      - interface: eth0
        dhcp: true
- op: add
  path: /machine/install
  value:
    extraKernelArgs:
      - net.ifnames=0
- op: replace
  path: /cluster/allowSchedulingOnControlPlanes
  value: true
- op: replace
  path: /machine/time
  value:
    disabled: false
    servers:
      - 172.30.110.1

The operators were successfully installed with the following command:

clusterctl init --infrastructure vsphere:v1.8.1 --bootstrap talos:v0.6.2 --control-plane talos:v0.5.3 --target-namespace cluster-api-system

However, we then tried to create a basic cluster via Kustomize and minimalistic Cluster API manifests, and failed to succeed due to bootstrap failures related to hostname resolution issues using the DNS search list.

Click to expand manifests ```yaml ➜ k kustomize observability-cluster-poc apiVersion: v1 kind: Secret metadata: name: observability-cluster-poc namespace: cluster-api-system stringData: password: REDACTED username: clusterapi --- apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3 kind: TalosConfigTemplate metadata: name: observability-cluster-poc-md-0 namespace: cluster-api-system spec: template: spec: configPatches: - op: add path: /machine/network value: interfaces: - dhcp: true interface: eth0 nameservers: - 172.30.110.1 - op: add path: /machine/install value: extraKernelArgs: - net.ifnames=0 - op: add path: /cluster/network/cni value: name: none - op: add path: /cluster/proxy value: disabled: true - op: add path: /machine/features/kubePrism value: enabled: true port: 7445 - op: replace path: /cluster/controlPlane value: endpoint: https://172.30.11.10:6443 - op: add path: /machine/certSANs value: - 172.30.11.10 - op: add path: /machine/time value: disabled: false servers: - 172.30.110.1 generateType: worker talosVersion: v1.5.2 --- apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: labels: cluster.x-k8s.io/cluster-name: observability-cluster-poc name: observability-cluster-poc namespace: cluster-api-system spec: controlPlaneRef: apiVersion: controlplane.cluster.x-k8s.io/v1alpha3 kind: TalosControlPlane name: observability-cluster-poc infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereCluster name: observability-cluster-poc --- apiVersion: cluster.x-k8s.io/v1beta1 kind: MachineDeployment metadata: labels: cluster.x-k8s.io/cluster-name: observability-cluster-poc name: observability-cluster-poc-md-0 namespace: cluster-api-system spec: clusterName: observability-cluster-poc replicas: 3 selector: matchLabels: {} strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 type: RollingUpdate template: metadata: labels: cluster.x-k8s.io/cluster-name: observability-cluster-poc spec: bootstrap: configRef: apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3 kind: TalosConfigTemplate name: observability-cluster-poc-md-0 clusterName: observability-cluster-poc infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate name: observability-cluster-poc-worker version: v1.27.5 --- apiVersion: controlplane.cluster.x-k8s.io/v1alpha3 kind: TalosControlPlane metadata: name: observability-cluster-poc namespace: cluster-api-system spec: controlPlaneConfig: controlplane: configPatches: - op: add path: /machine/network value: interfaces: - dhcp: true interface: eth0 vip: ip: 172.30.11.10 nameservers: - 172.30.110.1 - op: add path: /machine/install value: extraKernelArgs: - net.ifnames=0 - op: add path: /cluster/network/cni value: name: none - op: add path: /cluster/proxy value: disabled: true - op: add path: /machine/features/kubePrism value: enabled: true port: 7445 - op: replace path: /cluster/controlPlane value: endpoint: https://172.30.11.10:6443 - op: add path: /machine/certSANs value: - 172.30.11.10 - op: add path: /cluster/coreDNS value: disabled: true - op: add path: /machine/time value: disabled: false servers: - 172.30.110.1 generateType: controlplane talosVersion: v1.5.2 infrastructureTemplate: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate name: observability-cluster-poc replicas: 3 rolloutStrategy: rollingUpdate: maxSurge: 1 type: RollingUpdate version: v1.27.5 --- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereCluster metadata: name: observability-cluster-poc namespace: cluster-api-system spec: controlPlaneEndpoint: host: 172.30.11.10 port: 6443 identityRef: kind: Secret name: observability-cluster-poc server: REDACTED thumbprint: REDACTED --- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate metadata: name: observability-cluster-poc namespace: cluster-api-system spec: template: spec: cloneMode: linkedClone datacenter: REDACTED datastore: REDACTED diskGiB: 25 folder: cluster-api-vms memoryMiB: 8192 network: devices: - dhcp4: true networkName: PLATFORM-PRODUCTION-OBSERVABILITY - dhcp4: true networkName: PRODUCTION numCPUs: 2 os: Linux powerOffMode: hard resourcePool: Cluster-API-POC server: REDACTED storagePolicyName: "" template: talos-linux-1.5.2 thumbprint: REDACTED --- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate metadata: name: observability-cluster-poc-worker namespace: cluster-api-system spec: template: spec: cloneMode: linkedClone customVMXKeys: disk.EnableUUID: "true" datacenter: REDACTED datastore: REDACTED diskGiB: 25 folder: cluster-api-vms memoryMiB: 8192 network: devices: - dhcp4: true networkName: PLATFORM-PRODUCTION-OBSERVABILITY - dhcp4: true networkName: PRODUCTION numCPUs: 2 os: Linux powerOffMode: hard resourcePool: Cluster-API-POC server: REDACTED storagePolicyName: "" template: talos-linux-1.5.2 thumbprint: REDACTED ```

Logs and outputs

On the CACPPT controller, the following logs are repeating upon each reconciliation attempt:

➜ k logs cacppt-controller-manager-5b4b84c466-xrq88 --tail 5

### Output

2023-10-06T06:52:14Z    INFO    controllers.TalosControlPlane   failed to get kubeconfig for the cluster    {"error": "failed to create cluster accessor: error creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "errorVerbose": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\nerror creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).createClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:396\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).newClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:299\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:273\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598\nfailed to create cluster accessor\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:275\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598"}
2023-10-06T06:52:14Z    INFO    controllers.TalosControlPlane   successfully updated control plane status   {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "cluster": "observability-cluster-poc"}
2023-10-06T06:52:14Z    INFO    reconcile TalosControlPlane {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"observability-cluster-poc","namespace":"cluster-api-system"}, "namespace": "cluster-api-system", "name": "observability-cluster-poc", "reconcileID": "da9352b1-18b1-46a1-a2f3-7afbc3f7ecec", "cluster": "observability-cluster-poc"}
2023-10-06T06:52:14Z    INFO    controllers.TalosControlPlane   bootstrap failed, retrying in 20 seconds    {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-hvv4m\""}
2023-10-06T06:52:14Z    INFO    controllers.TalosControlPlane   attempting to set control plane status

The line 2023-10-06T06:52:14Z INFO controllers.TalosControlPlane bootstrap failed, retrying in 20 seconds {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-hvv4m\""} is especially troublesome and we digged further around this without any success.

Our networking is solely handled by a dedicated OpnSense instance.

All the VMs have DHCP enabled and DHCP Leases are automatically and successfully registered for all our VMs, including the new ones that are failing to bootstrap.

When I try to create a dummy nettool Pod is the cluster, everything is working like a charm and we can see that /etc/resolv.conf is properly configured with the appropriate search list:

➜ kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
➜ tmp-shell  ~  dig +search +short observability-cluster-poc-hvv4m
# Output
172.30.110.113

➜ tmp-shell  ~  cat /etc/resolv.conf 
# Output
search cluster-api-system.svc.cluster.local svc.cluster.local cluster.local js.lan
nameserver 10.96.0.10
options ndots:5

I have a similar behavior when directly attaching a nettool debug container on the CAPPT controller:

➜ kubectl debug -it cacppt-controller-manager-5b4b84c466-xrq88 --image=nicolaka/netshoot --target=manager
➜ cacppt-controller-manager-5b4b84c466-xrq88  ~  dig +search +short observability-cluster-poc-hvv4m
# Output
172.30.110.113

➜ cacppt-controller-manager-5b4b84c466-xrq88  ~  cat /etc/resolv.conf 
# Output
search cluster-api-system.svc.cluster.local svc.cluster.local cluster.local js.lan
nameserver 10.96.0.10
options ndots:5

I also attached a debug container to coredns, just in case:

➜ kubectl debug -it coredns-78f679c54d-87pfz --image=nicolaka/netshoot --target=coredns 
➜ coredns-78f679c54d-87pfz  ~  dig +search +short observability-cluster-poc-hvv4m
# Output
172.30.110.113

➜ coredns-78f679c54d-87pfz  ~  cat /etc/resolv.conf 
# Output
search js.lan
nameserver 172.30.110.1

Thanks for your help.

smira commented 9 months ago

This error is about a Machine CRD in your management cluster, not about Talos itself. CACPPT needs addresses to talk to the Talos API. It should be the infrastructure provide job to provide these addresses.

julien-sugg commented 9 months ago

Thanks for the update, the error indeed got me in the wrong way.

The real issue is that I temporarily commented out the vm tools extra manifests configuration which is why the IPs were not retrieved any more at vSphere level. Indeed, the underlying vspheremachines were stuck in the WaitingForIPAllocation status.

Enabling it back solved the issue

---
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  ...
spec:
  ...
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: v1.5.2
      configPatches:
        ...
        - op: replace
          path: /cluster/extraManifests
          value:
            - "https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml"