siderolabs / cluster-api-bootstrap-provider-talos

A cluster-api bootstrap provider for deploying Talos clusters.
https://www.talos-systems.com
Mozilla Public License 2.0
103 stars 27 forks source link

Question/issue around Talos bootstrap with Cluster API & vSphere infrastructure (CAPV) #179

Open julien-sugg opened 10 months ago

julien-sugg commented 10 months ago

Greetings,

We've been playing with Talos Linux and Cluster API to automate the management of our clusters, and are currently facing some questions/issues around the bootstrap process using the vSphere infrastructure provider.

Versions / Environment

Description

According to the Talos - VMware documentation, we have to install a custom talos-vmtools with some dedicated Talos config.

This totally makes senses, however, my concern if the following:

In order to bootstrap the cluster via Cluster API, and especially the CACPPT controller, I need my CAPV controller to retrieve the IP address of the VM via the vCenter API. However, such IP is only available upon successful installation and configuration of the VMTools. Unfortunately, to install the VMTools, I need to necessarily have the Talos bootstrap done due to the fact that it is deployed as a DaemonSet. This makes us hit the chicken/egg problem.

Our current workaround is to manually bootstrap the cluster via the IP addresses provided by the DHCP. However, this is quite a pain as we wish to automate everything via GitOps since we will manage quite a lot of permanent clusters, but also some ephemeral ones.

Do you have any insights or recommendations to achieve such goal using the VMware ecosystem ?

Reproduce Steps

The following steps can be performed to easily reproduce the issue:

  1. Create a transient cluster that will be used to spawn the first permanent management cluster via Cluster API.

The cluster can either be created directly on vSphere or kind/k3d/...

  1. Initialize Cluster API components on the transient cluster with clusterctl with CAPV, CABPT and CACPPT
clusterctl init \
--infrastructure vsphere:v1.8.1 \
--bootstrap talos:v0.6.2 \
--control-plane talos:v0.5.3 \
--target-namespace cluster-api-system
  1. Create the permanent management cluster with the following minimal manifests:
Click to expand manifests ```yaml --- apiVersion: v1 kind: Secret metadata: name: observability-cluster-poc namespace: cluster-api-system stringData: password: REDACTED username: REDACTED --- apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3 kind: TalosConfigTemplate metadata: name: observability-cluster-poc-md-0 namespace: cluster-api-system spec: template: spec: configPatches: - op: add path: /machine/network value: interfaces: - dhcp: true dhcpOptions: routeMetric: 1 interface: eth0 - dhcp: true dhcpOptions: routeMetric: 10 interface: eth1 - op: add path: /machine/install value: extraKernelArgs: - net.ifnames=0 - op: add path: /cluster/network/cni value: name: none - op: add path: /cluster/proxy value: disabled: true - op: add path: /machine/features/kubePrism value: enabled: true port: 7445 - op: replace path: /cluster/controlPlane value: endpoint: https://172.30.11.10:6443 - op: add path: /machine/certSANs value: - 172.30.11.10 - op: add path: /machine/time value: disabled: false servers: - 172.30.110.1 - op: replace path: /cluster/extraManifests value: - https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml - op: add path: /machine/kubelet/extraArgs value: cloud-provider: external generateType: worker --- apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: labels: cluster.x-k8s.io/cluster-name: observability-cluster-poc name: observability-cluster-poc namespace: cluster-api-system spec: controlPlaneRef: apiVersion: controlplane.cluster.x-k8s.io/v1alpha3 kind: TalosControlPlane name: observability-cluster-poc infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereCluster name: observability-cluster-poc --- apiVersion: cluster.x-k8s.io/v1beta1 kind: MachineDeployment metadata: labels: cluster.x-k8s.io/cluster-name: observability-cluster-poc name: observability-cluster-poc-md-0 namespace: cluster-api-system spec: clusterName: observability-cluster-poc replicas: 3 selector: matchLabels: {} strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 type: RollingUpdate template: metadata: labels: cluster.x-k8s.io/cluster-name: observability-cluster-poc spec: bootstrap: configRef: apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3 kind: TalosConfigTemplate name: observability-cluster-poc-md-0 clusterName: observability-cluster-poc infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate name: observability-cluster-poc-worker version: v1.27.5 --- apiVersion: controlplane.cluster.x-k8s.io/v1alpha3 kind: TalosControlPlane metadata: name: observability-cluster-poc namespace: cluster-api-system spec: controlPlaneConfig: controlplane: configPatches: - op: add path: /machine/network value: interfaces: - dhcp: true dhcpOptions: routeMetric: 1 interface: eth0 vip: ip: 172.30.11.10 - dhcp: true dhcpOptions: routeMetric: 10 interface: eth1 - op: add path: /machine/install value: extraKernelArgs: - net.ifnames=0 - op: add path: /cluster/network/cni value: name: none - op: add path: /cluster/proxy value: disabled: true - op: add path: /machine/features/kubePrism value: enabled: true port: 7445 - op: replace path: /cluster/controlPlane value: endpoint: https://172.30.11.10:6443 - op: add path: /machine/certSANs value: - 172.30.11.10 - op: add path: /cluster/coreDNS value: disabled: true - op: add path: /machine/time value: disabled: false servers: - 172.30.110.1 - op: replace path: /cluster/extraManifests value: - https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml - op: add path: /machine/kubelet/extraArgs value: cloud-provider: external generateType: controlplane infrastructureTemplate: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate name: observability-cluster-poc replicas: 3 rolloutStrategy: rollingUpdate: maxSurge: 1 type: RollingUpdate version: v1.27.6 --- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereCluster metadata: name: observability-cluster-poc namespace: cluster-api-system spec: controlPlaneEndpoint: host: 172.30.11.10 port: 6443 identityRef: kind: Secret name: observability-cluster-poc server: REDACTED thumbprint: REDACTED --- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate metadata: name: observability-cluster-poc namespace: cluster-api-system spec: template: spec: cloneMode: linkedClone customVMXKeys: disk.EnableUUID: "true" datacenter: REDACTED datastore: REDACTED diskGiB: 25 folder: cluster-api-vms memoryMiB: 8192 network: devices: - dhcp4: true dhcp4Overrides: routeMetric: 1 networkName: PLATFORM-PRODUCTION-OBSERVABILITY - dhcp4: true dhcp4Overrides: routeMetric: 10 networkName: PRODUCTION numCPUs: 2 os: Linux powerOffMode: hard resourcePool: Cluster-API-POC server: REDACTED storagePolicyName: "" tagIDs: - urn:vmomi:InventoryServiceTag:0fe8eb41-7a8f-47b3-a9fe-0d288ec787dd:GLOBAL - urn:vmomi:InventoryServiceTag:4495a9ce-727a-4814-b067-682b52130cad:GLOBAL template: talos-linux-1.5.2 thumbprint: REDACTED --- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: VSphereMachineTemplate metadata: name: observability-cluster-poc-worker namespace: cluster-api-system spec: template: spec: cloneMode: linkedClone customVMXKeys: disk.EnableUUID: "true" datacenter: REDACTED datastore: REDACTED diskGiB: 25 folder: cluster-api-vms memoryMiB: 8192 network: devices: - dhcp4: true dhcp4Overrides: routeMetric: 1 networkName: PLATFORM-PRODUCTION-OBSERVABILITY - dhcp4: true dhcp4Overrides: routeMetric: 10 networkName: PRODUCTION numCPUs: 2 os: Linux powerOffMode: hard resourcePool: Cluster-API-POC server: REDACTED storagePolicyName: "" tagIDs: - urn:vmomi:InventoryServiceTag:0fe8eb41-7a8f-47b3-a9fe-0d288ec787dd:GLOBAL - urn:vmomi:InventoryServiceTag:4495a9ce-727a-4814-b067-682b52130cad:GLOBAL template: talos-linux-1.5.2 thumbprint: REDACTED ```
  1. Once the VMs are created, confirm that the bootstrap cannot occur since VMTools cannot be installed and the bootstrap cannot be done either as it cannot reach the VMs due to the lack of IP Addresses at vCenter level.

Useful outputs/content

Talos console:

image

vSphere machine (no IP due to VMtools not being installable at this point in time):

image

CACPPT logs:

2023-10-20T06:56:47Z    INFO    reconcile TalosControlPlane     {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"observability-cluster-poc","namespace":"cluster-api-system"}, "namespace": "cluster-api-system", "name": "observability-cluster-poc", "reconcileID": "be96027a-b052-4819-bd53-8215a326733f", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:47Z    INFO    controllers.TalosControlPlane   bootstrap failed, retrying in 20 seconds        {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-bzpgr\""}
2023-10-20T06:56:47Z    INFO    controllers.TalosControlPlane   attempting to set control plane status
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   failed to get kubeconfig for the cluster        {"error": "failed to create cluster accessor: error creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "errorVerbose": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://172.30.11.10:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\nerror creating client for remote cluster \"cluster-api-system/observability-cluster-poc\": error getting rest mapping\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).createClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:396\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).newClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:299\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:273\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598\nfailed to create cluster accessor\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:275\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598"}
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   successfully updated control plane status       {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:57Z    INFO    reconcile TalosControlPlane     {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"observability-cluster-poc","namespace":"cluster-api-system"}, "namespace": "cluster-api-system", "name": "observability-cluster-poc", "reconcileID": "2bb6e4b1-8a51-4c48-b463-eb6b0a915de8", "cluster": "observability-cluster-poc"}
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   bootstrap failed, retrying in 20 seconds        {"namespace": "cluster-api-system", "talosControlPlane": "observability-cluster-poc", "error": "no addresses were found for node \"observability-cluster-poc-bzpgr\""}
2023-10-20T06:56:57Z    INFO    controllers.TalosControlPlane   attempting to set control plane status

Thanks in advance for your help and insights.

smira commented 10 months ago

It was discussed in community Slack, but it didn't quite go that far.

VMWare users need to reimplement vmtoolsd to be a Talos system extension (and an extension service), this way it will run always with the machine.

Another option is to make Talos itself report IPs, if we can do that without pulling all VMWare libraries in.

sempex commented 4 months ago

Hi everyone, I face the same problem right now. Are there any updates or instructions to follow to work around this?

amaol-vestas commented 1 week ago

Also interested to see the fix for this issue, thanks?

amaol-vestas commented 1 week ago

I found a way to deploy, just create a TalosOS with vmtoolds installed by default using Talos image fabric and the use that one as baseline template for the deployment, please check here [https://factory.talos.dev/].