kubernetes-sigs / cluster-api-provider-azure

Cluster API implementation for Microsoft Azure
https://capz.sigs.k8s.io/
Apache License 2.0
291 stars 421 forks source link

Nodes fail to come up when using custom CA and Kubeconfig #2778

Closed karansinghneu closed 1 year ago

karansinghneu commented 1 year ago

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?] Yes

What steps did you take and what happened: [A clear and concise description of what the bug is.]

  1. Created a bootstrap kind cluster
  2. Provisioned a Target Management Cluster that has a VM identity using ServicePrincipal
  3. Bootstrap and pivot to Target Management Cluster
  4. Deleted Kind cluster
  5. Created CA certs and used them to create a kubeconfig
  6. Mounted CA certs and kubeconfig as secrets onto the management cluster
  7. Provisioned a Workload cluster from the management cluster that has a VM identity using UserAssignedManagedIdentity, uses custom ca certs and kubeconfig, and custom frontEndIPs name, ip name and fqdn

What did you expect to happen: Workload cluster to provision successfully

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

  1. Control Plane node is Provisioned and has a provider ID but no Nodename
  2. Worker nodes neither have a provider ID nor a Nodename

Environment:

CecileRobertMichon commented 1 year ago

@karansinghneu can you please upload any logs that you've collected (controller logs, cloud init logs would be helpful - see https://capz.sigs.k8s.io/topics/troubleshooting.html) and the cluster yaml spec you used for the AzureCluster ?

(make sure to redact any secrets)

karansinghneu commented 1 year ago

@CecileRobertMichon Controller logs:

I1107 22:57:13.937207 1 azuremachine_controller.go:243] controllers.AzureMachineReconciler.reconcileNormal "msg"="Reconciling AzureMachine" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-6xc4w","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-6xc4w" "namespace"="default" "reconcileID"="60fcfd13-cb41-452a-883e-288cfb3bcbc0" "x-ms-correlation-request-id"="dd9ef87a-942f-4d58-971a-3e39b98684d3" I1107 22:57:13.938142 1 machine.go:655] scope.MachineScope.GetVMImage "msg"="No image specified for machine, using default Linux Image" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-6xc4w","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-6xc4w" "namespace"="default" "reconcileID"="60fcfd13-cb41-452a-883e-288cfb3bcbc0" "x-ms-correlation-request-id"="dd9ef87a-942f-4d58-971a-3e39b98684d3" "machine"="capz-acr-cluster-workload-1-control-plane-6xc4w" I1107 22:57:13.938245 1 images.go:124] virtualmachineimages.Service.getSKUAndVersion "msg"="Getting VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-6xc4w","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-6xc4w" "namespace"="default" "reconcileID"="60fcfd13-cb41-452a-883e-288cfb3bcbc0" "x-ms-correlation-request-id"="dd9ef87a-942f-4d58-971a-3e39b98684d3" "k8sVersion"="v1.25.0" "location"="australiaeast" "offer"="capi" "osAndVersion"="ubuntu-2004" "publisher"="cncf-upstream" I1107 22:57:13.938331 1 cache.go:122] virtualmachineimages.Cache.Get "msg"="VM images cache hit" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-6xc4w","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-6xc4w" "namespace"="default" "reconcileID"="60fcfd13-cb41-452a-883e-288cfb3bcbc0" "x-ms-correlation-request-id"="dd9ef87a-942f-4d58-971a-3e39b98684d3" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" I1107 22:57:13.938403 1 images.go:176] virtualmachineimages.Service.getSKUAndVersion "msg"="Found VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-6xc4w","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-6xc4w" "namespace"="default" "reconcileID"="60fcfd13-cb41-452a-883e-288cfb3bcbc0" "x-ms-correlation-request-id"="dd9ef87a-942f-4d58-971a-3e39b98684d3" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" "version"="125.0.20220824" I1107 22:57:13.939590 1 machine.go:655] scope.MachineScope.GetVMImage "msg"="No image specified for machine, using default Linux Image" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-r5b7j","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-r5b7j" "namespace"="default" "reconcileID"="dfe5729b-701f-47d6-bfb5-2a97d61e150d" "x-ms-correlation-request-id"="37e77e38-10a2-42b5-a9f6-5530c55cff8e" "machine"="capz-acr-cluster-workload-1-control-plane-r5b7j" I1107 22:57:13.941461 1 azuremachine_controller.go:243] controllers.AzureMachineReconciler.reconcileNormal "msg"="Reconciling AzureMachine" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-d72hp","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-d72hp" "namespace"="default" "reconcileID"="fbe64a85-3b54-4479-a1ca-bda436260e4b" "x-ms-correlation-request-id"="791b2ed9-5e25-4a0f-8635-1a24ac5a0f8c" I1107 22:57:13.944061 1 azuremachine_controller.go:243] controllers.AzureMachineReconciler.reconcileNormal "msg"="Reconciling AzureMachine" "azureMachine"={"name":"capz-acr-cluster-workload-1-md-0-x59mx","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-md-0-x59mx" "namespace"="default" "reconcileID"="632b9667-aff5-4f6c-9210-c93d633fd0dc" "x-ms-correlation-request-id"="f046dd0c-e4d2-48e8-b5de-b0036afd0e60" I1107 22:57:13.949187 1 images.go:124] virtualmachineimages.Service.getSKUAndVersion "msg"="Getting VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-r5b7j","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-r5b7j" "namespace"="default" "reconcileID"="dfe5729b-701f-47d6-bfb5-2a97d61e150d" "x-ms-correlation-request-id"="37e77e38-10a2-42b5-a9f6-5530c55cff8e" "k8sVersion"="v1.25.0" "location"="australiaeast" "offer"="capi" "osAndVersion"="ubuntu-2004" "publisher"="cncf-upstream" I1107 22:57:13.945320 1 azuremachine_controller.go:243] controllers.AzureMachineReconciler.reconcileNormal "msg"="Reconciling AzureMachine" "azureMachine"={"name":"capz-acr-cluster-workload-2-control-plane-wwr6v","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-2-control-plane-wwr6v" "namespace"="default" "reconcileID"="9c5fa450-e2ae-471e-befa-1b566a8f60d0" "x-ms-correlation-request-id"="5c6f5b6e-9e9c-4380-aec5-d2e7ca93f816" I1107 22:57:13.949270 1 cache.go:122] virtualmachineimages.Cache.Get "msg"="VM images cache hit" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-r5b7j","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-r5b7j" "namespace"="default" "reconcileID"="dfe5729b-701f-47d6-bfb5-2a97d61e150d" "x-ms-correlation-request-id"="37e77e38-10a2-42b5-a9f6-5530c55cff8e" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" I1107 22:57:13.949344 1 images.go:176] virtualmachineimages.Service.getSKUAndVersion "msg"="Found VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-r5b7j","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-r5b7j" "namespace"="default" "reconcileID"="dfe5729b-701f-47d6-bfb5-2a97d61e150d" "x-ms-correlation-request-id"="37e77e38-10a2-42b5-a9f6-5530c55cff8e" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" "version"="125.0.20220824" I1107 22:57:13.957569 1 machine.go:655] scope.MachineScope.GetVMImage "msg"="No image specified for machine, using default Linux Image" "azureMachine"={"name":"capz-acr-cluster-workload-2-control-plane-wwr6v","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-2-control-plane-wwr6v" "namespace"="default" "reconcileID"="9c5fa450-e2ae-471e-befa-1b566a8f60d0" "x-ms-correlation-request-id"="5c6f5b6e-9e9c-4380-aec5-d2e7ca93f816" "machine"="capz-acr-cluster-workload-2-control-plane-wwr6v" I1107 22:57:13.957655 1 images.go:124] virtualmachineimages.Service.getSKUAndVersion "msg"="Getting VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-2-control-plane-wwr6v","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-2-control-plane-wwr6v" "namespace"="default" "reconcileID"="9c5fa450-e2ae-471e-befa-1b566a8f60d0" "x-ms-correlation-request-id"="5c6f5b6e-9e9c-4380-aec5-d2e7ca93f816" "k8sVersion"="v1.25.0" "location"="australiaeast" "offer"="capi" "osAndVersion"="ubuntu-2004" "publisher"="cncf-upstream" I1107 22:57:13.957685 1 machine.go:655] scope.MachineScope.GetVMImage "msg"="No image specified for machine, using default Linux Image" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-d72hp","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-d72hp" "namespace"="default" "reconcileID"="fbe64a85-3b54-4479-a1ca-bda436260e4b" "x-ms-correlation-request-id"="791b2ed9-5e25-4a0f-8635-1a24ac5a0f8c" "machine"="capz-acr-cluster-workload-1-control-plane-d72hp" I1107 22:57:13.957729 1 cache.go:122] virtualmachineimages.Cache.Get "msg"="VM images cache hit" "azureMachine"={"name":"capz-acr-cluster-workload-2-control-plane-wwr6v","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-2-control-plane-wwr6v" "namespace"="default" "reconcileID"="9c5fa450-e2ae-471e-befa-1b566a8f60d0" "x-ms-correlation-request-id"="5c6f5b6e-9e9c-4380-aec5-d2e7ca93f816" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" I1107 22:57:13.957787 1 images.go:124] virtualmachineimages.Service.getSKUAndVersion "msg"="Getting VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-d72hp","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-d72hp" "namespace"="default" "reconcileID"="fbe64a85-3b54-4479-a1ca-bda436260e4b" "x-ms-correlation-request-id"="791b2ed9-5e25-4a0f-8635-1a24ac5a0f8c" "k8sVersion"="v1.25.0" "location"="australiaeast" "offer"="capi" "osAndVersion"="ubuntu-2004" "publisher"="cncf-upstream" I1107 22:57:13.957799 1 images.go:176] virtualmachineimages.Service.getSKUAndVersion "msg"="Found VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-2-control-plane-wwr6v","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-2-control-plane-wwr6v" "namespace"="default" "reconcileID"="9c5fa450-e2ae-471e-befa-1b566a8f60d0" "x-ms-correlation-request-id"="5c6f5b6e-9e9c-4380-aec5-d2e7ca93f816" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" "version"="125.0.20220824" I1107 22:57:13.957875 1 cache.go:122] virtualmachineimages.Cache.Get "msg"="VM images cache hit" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-d72hp","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-d72hp" "namespace"="default" "reconcileID"="fbe64a85-3b54-4479-a1ca-bda436260e4b" "x-ms-correlation-request-id"="791b2ed9-5e25-4a0f-8635-1a24ac5a0f8c" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" I1107 22:57:13.957961 1 images.go:176] virtualmachineimages.Service.getSKUAndVersion "msg"="Found VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-control-plane-d72hp","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-control-plane-d72hp" "namespace"="default" "reconcileID"="fbe64a85-3b54-4479-a1ca-bda436260e4b" "x-ms-correlation-request-id"="791b2ed9-5e25-4a0f-8635-1a24ac5a0f8c" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" "version"="125.0.20220824" I1107 22:57:13.959239 1 machine.go:655] scope.MachineScope.GetVMImage "msg"="No image specified for machine, using default Linux Image" "azureMachine"={"name":"capz-acr-cluster-workload-1-md-0-x59mx","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-md-0-x59mx" "namespace"="default" "reconcileID"="632b9667-aff5-4f6c-9210-c93d633fd0dc" "x-ms-correlation-request-id"="f046dd0c-e4d2-48e8-b5de-b0036afd0e60" "machine"="capz-acr-cluster-workload-1-md-0-x59mx" I1107 22:57:13.959538 1 images.go:124] virtualmachineimages.Service.getSKUAndVersion "msg"="Getting VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-md-0-x59mx","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-md-0-x59mx" "namespace"="default" "reconcileID"="632b9667-aff5-4f6c-9210-c93d633fd0dc" "x-ms-correlation-request-id"="f046dd0c-e4d2-48e8-b5de-b0036afd0e60" "k8sVersion"="v1.25.0" "location"="australiaeast" "offer"="capi" "osAndVersion"="ubuntu-2004" "publisher"="cncf-upstream" I1107 22:57:13.960282 1 cache.go:122] virtualmachineimages.Cache.Get "msg"="VM images cache hit" "azureMachine"={"name":"capz-acr-cluster-workload-1-md-0-x59mx","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-md-0-x59mx" "namespace"="default" "reconcileID"="632b9667-aff5-4f6c-9210-c93d633fd0dc" "x-ms-correlation-request-id"="f046dd0c-e4d2-48e8-b5de-b0036afd0e60" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" I1107 22:57:13.960378 1 images.go:176] virtualmachineimages.Service.getSKUAndVersion "msg"="Found VM image SKU and version" "azureMachine"={"name":"capz-acr-cluster-workload-1-md-0-x59mx","namespace":"default"} "controller"="azuremachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureMachine" "name"="capz-acr-cluster-workload-1-md-0-x59mx" "namespace"="default" "reconcileID"="632b9667-aff5-4f6c-9210-c93d633fd0dc" "x-ms-correlation-request-id"="f046dd0c-e4d2-48e8-b5de-b0036afd0e60" "location"="australiaeast" "offer"="capi" "publisher"="cncf-upstream" "sku"="ubuntu-2004-gen1" "version"="125.0.20220824"

$kubectl get azuremachines

capz-acr-cluster-mgmt-control-plane-xrtk9 True Succeeded capz-acr-cluster-mgmt-md-0-pqdnb True Succeeded capz-acr-cluster-workload-1-control-plane-6xc4w True Succeeded capz-acr-cluster-workload-1-control-plane-d72hp True Succeeded capz-acr-cluster-workload-1-control-plane-r5b7j True Succeeded capz-acr-cluster-workload-1-md-0-2t9t8 True Succeeded capz-acr-cluster-workload-1-md-0-nldvq True Succeeded capz-acr-cluster-workload-1-md-0-x59mx True Succeeded capz-acr-cluster-workload-2-control-plane-wwr6v True Succeeded capz-acr-cluster-workload-2-md-0-hfd6v False WaitingForBootstrapData

From control plane node: $kubectl get azuremachines

The connection to the server localhost:8080 was refused - did you specify the right host or port?

$less /var/log/cloud-init-output.log [2022-11-04 21:39:42] Generating public/private rsa key pair. [2022-11-04 21:39:42] Your identification has been saved in /etc/ssh/ssh_host_rsa_key [2022-11-04 21:39:42] Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub [2022-11-04 21:39:42] The key fingerprint is: [2022-11-04 21:39:42] SHA256 root@capz-acr-cluster-workload-2-control-plane-wwr6v [2022-11-04 21:39:42] The key's randomart image is: ... ... [2022-11-04 21:39:42] Generating public/private dsa key pair. [2022-11-04 21:39:42] Your identification has been saved in /etc/ssh/ssh_host_dsa_key [2022-11-04 21:39:42] Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub [2022-11-04 21:39:42] The key fingerprint is: [2022-11-04 21:39:42] SHA256: oot@capz-acr-cluster-workload-2-control-plane-wwr6v [2022-11-04 21:39:42] The key's randomart image is: ... ... [2022-11-04 21:39:42] Generating public/private ecdsa key pair. [2022-11-04 21:39:42] Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key [2022-11-04 21:39:42] Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub [2022-11-04 21:39:42] The key fingerprint is: [2022-11-04 21:39:42] SHA256: root@capz-acr-cluster-workload-2-control-plane-wwr6v [2022-11-04 21:39:42] The key's randomart image is: ... ... 2022-11-04 21:39:42] Generating public/private ed25519 key pair. [2022-11-04 21:39:42] Your identification has been saved in /etc/ssh/ssh_host_ed25519_key [2022-11-04 21:39:42] Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub [2022-11-04 21:39:42] The key fingerprint is: [2022-11-04 21:39:42] SHA256 root@capz-acr-cluster-workload-2-control-plane-wwr6v [2022-11-04 21:39:42] The key's randomart image is: ... ... [2022-11-04 21:39:50] Cloud-init v. 22.2-0ubuntu1\~20.04.3 running 'modules:config' at Fri, 04 Nov 2022 21:39:49 +0000. Up 26.88 seconds. [2022-11-04 21:39:55] [init] Using Kubernetes version: v1.25.0 [2022-11-04 21:39:55] [preflight] Running pre-flight checks [2022-11-04 21:39:59] [preflight] Pulling images required for setting up a Kubernetes cluster [2022-11-04 21:39:59] [preflight] This might take a minute or two, depending on the speed of your internet connection [2022-11-04 21:39:59] [preflight] You can also perform this action in beforehand using 'kubeadm config images pull' [2022-11-04 21:39:59] [certs] Using certificateDir folder "/etc/kubernetes/pki" [2022-11-04 21:39:59] [certs] Using existing ca certificate authority [2022-11-04 21:39:59] [certs] Generating "apiserver" certificate and key [2022-11-04 21:39:59] [certs] apiserver serving cert is signed for DNS names [capz-acr-cluster-workload-2-control-plane-wwr6v capz-acr-cluster-workload-2-pdns.australiaeast.cloudapp.azure.com kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.0.0.4] [2022-11-04 21:39:59] [certs] Generating "apiserver-kubelet-client" certificate and key [2022-11-04 21:39:59] [certs] Using existing front-proxy-ca certificate authority [2022-11-04 21:39:59] [certs] Generating "front-proxy-client" certificate and key [2022-11-04 21:39:59] [certs] Using existing etcd/ca certificate authority [2022-11-04 21:40:00] [certs] Generating "etcd/server" certificate and key [2022-11-04 21:40:00] [certs] etcd/server serving cert is signed for DNS names [capz-acr-cluster-workload-2-control-plane-wwr6v localhost] and IPs [10.0.0.4 127.0.0.1 ::1] [2022-11-04 21:40:00] [certs] Generating "etcd/peer" certificate and key [2022-11-04 21:40:00] [certs] etcd/peer serving cert is signed for DNS names [capz-acr-cluster-workload-2-control-plane-wwr6v localhost] and IPs [10.0.0.4 127.0.0.1 ::1] [2022-11-04 21:40:00] [certs] Generating "etcd/healthcheck-client" certificate and key [2022-11-04 21:40:00] [certs] Generating "apiserver-etcd-client" certificate and key [2022-11-04 21:40:00] [certs] Using the existing "sa" key [2022-11-04 21:40:00] [kubeconfig] Using kubeconfig folder "/etc/kubernetes" [2022-11-04 21:40:01] [kubeconfig] Writing "admin.conf" kubeconfig file [2022-11-04 21:40:01] [kubeconfig] Writing "kubelet.conf" kubeconfig file [2022-11-04 21:40:01] [kubeconfig] Writing "controller-manager.conf" kubeconfig file [2022-11-04 21:40:01] [kubeconfig] Writing "scheduler.conf" kubeconfig file [2022-11-04 21:40:01] [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env" [2022-11-04 21:40:01] [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml" [2022-11-04 21:40:01] [kubelet-start] Starting the kubelet [2022-11-04 21:40:02] [control-plane] Using manifest folder "/etc/kubernetes/manifests" [2022-11-04 21:40:02] [control-plane] Creating static Pod manifest for "kube-apiserver" [2022-11-04 21:40:02] [control-plane] Creating static Pod manifest for "kube-controller-manager" [2022-11-04 21:40:02] [control-plane] Creating static Pod manifest for "kube-scheduler" [2022-11-04 21:40:02] [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests" [2022-11-04 21:40:02] [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 20m0s [2022-11-04 21:40:35] [apiclient] All control plane components are healthy after 32.567248 seconds [2022-11-04 21:40:35] [upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace [2022-11-04 21:40:35] [kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster [2022-11-04 21:40:35] [upload-certs] Skipping phase. Please see --upload-certs [2022-11-04 21:40:35] [mark-control-plane] Marking the node capz-acr-cluster-workload-2-control-plane-wwr6v as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers] [2022-11-04 21:40:35] [mark-control-plane] Marking the node capz-acr-cluster-workload-2-control-plane-wwr6v as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule] [2022-11-04 21:40:36] [bootstrap-token] Using token: evxv6y.vdzajrctk6p0jn8h [2022-11-04 21:40:36] [bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles [2022-11-04 21:40:36] [bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes [2022-11-04 21:40:36] [bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials [2022-11-04 21:40:36] [bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token [2022-11-04 21:40:36] [bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster [2022-11-04 21:40:36] [bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace [2022-11-04 21:40:36] [kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key [2022-11-04 21:40:36] [addons] Applied essential addon: CoreDNS [2022-11-04 21:40:36] [addons] Applied essential addon: kube-proxy [2022-11-04 21:40:36] [2022-11-04 21:40:36] Your Kubernetes control-plane has initialized successfully! [2022-11-04 21:40:36] [2022-11-04 21:40:36] To start using your cluster, you need to run the following as a regular user: [2022-11-04 21:40:36] [2022-11-04 21:40:36] mkdir -p $HOME/.kube [2022-11-04 21:40:36] sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config [2022-11-04 21:40:36] sudo chown $(id -u):$(id -g) $HOME/.kube/config [2022-11-04 21:40:36] [2022-11-04 21:40:36] Alternatively, if you are the root user, you can run: [2022-11-04 21:40:36] [2022-11-04 21:40:36] export KUBECONFIG=/etc/kubernetes/admin.conf [2022-11-04 21:40:36] [2022-11-04 21:40:36] You should now deploy a pod network to the cluster. [2022-11-04 21:40:36] Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at: [2022-11-04 21:40:36] https://kubernetes.io/docs/concepts/cluster-administration/addons/ [2022-11-04 21:40:36] [2022-11-04 21:40:36] You can now join any number of control-plane nodes by copying certificate authorities [2022-11-04 21:40:36] and service account keys on each node and then running the following as root: [2022-11-04 21:40:36] [2022-11-04 21:40:36] kubeadm join capz-acr-cluster-workload-2-pdns.australiaeast.cloudapp.azure.com:6443 --token evxv6y.vdzajrctk6p0jn8h \ [2022-11-04 21:40:36] --discovery-token-ca-cert-hash sha256:1f408c6bd2e95036ceeca613eaf3340452abd3c376de7b9852d6729148e9ba13 \ [2022-11-04 21:40:36] --control-plane [2022-11-04 21:40:36] [2022-11-04 21:40:36] Then you can join any number of worker nodes by running the following on each as root: [2022-11-04 21:40:36] [2022-11-04 21:40:36] kubeadm join capz-acr-cluster-workload-2-pdns.australiaeast.cloudapp.azure.com:6443 --token evxv6y.vdzajrctk6p0jn8h \ [2022-11-04 21:40:36] --discovery-token-ca-cert-hash sha256:1f408c6bd2e95036ceeca613eaf3340452abd3c376de7b9852d6729148e9ba13 [2022-11-04 21:40:36] Cloud-init v. 22.2-0ubuntu1\~20.04.3 running 'modules:final' at Fri, 04 Nov 2022 21:39:51 +0000. Up 29.01 seconds. [2022-11-04 21:40:36] Cloud-init v. 22.2-0ubuntu1~20.04.3 finished at Fri, 04 Nov 2022 21:40:36 +0000. Datasource DataSourceAzure [seed=/dev/sr0]. Up 74.55 seconds

YAML Spec:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  labels:
    cni: calico
  name: capz-acr-cluster-workload-2
  namespace: default
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: capz-acr-cluster-workload-2-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureCluster
    name: capz-acr-cluster-workload-2

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureCluster
metadata:
  name: capz-acr-cluster-workload-2
  namespace: default
spec:
  identityRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureClusterIdentity
    name: dogfood5-acr-custom-script-identity
    namespace: default
  location: australiaeast
  networkSpec:
    apiServerLB:
      type: Public
      frontendIPs:
        - name: capz-acr-cluster-workload-2-public-lb-frontEnd
          publicIP:
            name: pip-capz-acr-cluster-workload-2-apiserver
            dnsName: capz-acr-cluster-workload-2-pdns.australiaeast.cloudapp.azure.com
    subnets:
    - name: control-plane-subnet
      role: control-plane
    - name: node-subnet
      natGateway:
        name: node-natgateway
      role: node
    vnet:
      name: capz-acr-cluster-workload-2-vnet
  resourceGroup: capz-acr-cluster-workload-2
  subscriptionID: a8a17819

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: capz-acr-cluster-workload-2-control-plane
  namespace: default
spec:
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          cloud-config: /etc/kubernetes/azure.json
          cloud-provider: azure
        extraVolumes:
        - hostPath: /etc/kubernetes/azure.json
          mountPath: /etc/kubernetes/azure.json
          name: cloud-config
          readOnly: true
        timeoutForControlPlane: 20m
      controllerManager:
        extraArgs:
          allocate-node-cidrs: "false"
          cloud-config: /etc/kubernetes/azure.json
          cloud-provider: azure
          cluster-name: capz-acr-cluster-workload-2
        extraVolumes:
        - hostPath: /etc/kubernetes/azure.json
          mountPath: /etc/kubernetes/azure.json
          name: cloud-config
          readOnly: true
      etcd:
        local:
          dataDir: /var/lib/etcddisk/etcd
          extraArgs:
            quota-backend-bytes: "8589934592"
    diskSetup:
      filesystems:
      - device: /dev/disk/azure/scsi1/lun0
        extraOpts:
        - -E
        - lazy_itable_init=1,lazy_journal_init=1
        filesystem: ext4
        label: etcd_disk
      - device: ephemeral0.1
        filesystem: ext4
        label: ephemeral0
        replaceFS: ntfs
      partitions:
      - device: /dev/disk/azure/scsi1/lun0
        layout: true
        overwrite: false
        tableType: gpt
    files:
    - contentFrom:
        secret:
          key: control-plane-azure.json
          name: capz-acr-cluster-workload-2-control-plane-azure-json
      owner: root:root
      path: /etc/kubernetes/azure.json
      permissions: "0644"
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          azure-container-registry-config: /etc/kubernetes/azure.json
          cloud-config: /etc/kubernetes/azure.json
          cloud-provider: azure
        name: '{{ ds.meta_data["local_hostname"] }}'
    joinConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          azure-container-registry-config: /etc/kubernetes/azure.json
          cloud-config: /etc/kubernetes/azure.json
          cloud-provider: azure
        name: '{{ ds.meta_data["local_hostname"] }}'
    mounts:
    - - LABEL=etcd_disk
      - /var/lib/etcddisk
    postKubeadmCommands: []
    preKubeadmCommands: []
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AzureMachineTemplate
      name: capz-acr-cluster-workload-2-control-plane
  replicas: 1
  version: v1.25.0

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachineTemplate
metadata:
  name: capz-acr-cluster-workload-2-control-plane
  namespace: default
spec:
  template:
    spec:
      identity: UserAssigned
      dataDisks:
      - diskSizeGB: 256
        lun: 0
        nameSuffix: etcddisk
      osDisk:
        diskSizeGB: 128
        osType: Linux
      sshPublicKey: ""
      userAssignedIdentities:
      - providerID: dogfood5-acr-custom-script-identity
      vmSize: Standard_D2s_v3

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: capz-acr-cluster-workload-2-md-0
  namespace: default
spec:
  clusterName: capz-acr-cluster-workload-2
  replicas: 1
  selector:
    matchLabels: null
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: capz-acr-cluster-workload-2-md-0
      clusterName: capz-acr-cluster-workload-2
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AzureMachineTemplate
        name: capz-acr-cluster-workload-2-md-0
      version: v1.25.0

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureMachineTemplate
metadata:
  name: capz-acr-cluster-workload-2-md-0
  namespace: default
spec:
  template:
    spec:
      identity: UserAssigned
      osDisk:
        diskSizeGB: 128
        osType: Linux
      sshPublicKey: ""
      userAssignedIdentities:
      - providerID: dogfood5-acr-custom-script-identity
      vmSize: Standard_D2s_v3

apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
  name: capz-acr-cluster-workload-2-md-0
  namespace: default
spec:
  template:
    spec:
      files:
      - contentFrom:
          secret:
            key: worker-node-azure.json
            name: capz-acr-cluster-workload-2-md-0-azure-json
        owner: root:root
        path: /etc/kubernetes/azure.json
        permissions: "0644"
      joinConfiguration:
        nodeRegistration:
          kubeletExtraArgs:
            azure-container-registry-config: /etc/kubernetes/azure.json
            cloud-config: /etc/kubernetes/azure.json
            cloud-provider: azure
          name: '{{ ds.meta_data["local_hostname"] }}'
      preKubeadmCommands: []

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureClusterIdentity
metadata:
  labels:
    clusterctl.cluster.x-k8s.io/move-hierarchy: "true"
  name: dogfood5-acr-custom-script-identity
  namespace: default
spec:
  allowedNamespaces: {}
  clientID: cfa59eda-e284-4d05-9582-c540d1379376
  resourceID: "dogfood5-acr-custom-script-identity"
  tenantID: 33e01921-4d64-4f8c-a055-5bdaffd5e33d
  type: UserAssignedMSI
karansinghneu commented 1 year ago

Update: It's most likely the custom FQDN causing the issue. I tried spinning up a cluster by just mounting the certificates as secrets without the custom fqdn and everything works just fine but as soon as I put in the custom fqdn, things start to fail. Still investigating further!

Further investigation: Mounting CA certs as secrets and providing a custom FQDN results in 1 worker node unable to join the cluster, rest everything comes up normally. When I spin up a workload cluster with 3 control plane nodes and 3 worker nodes then 2 worker nodes come up and 1 doesn't while the MachineDeployment gets stuck in WaitingForAvailableMachines state. Similarly when I spin up a workload cluster with 1 control plane node and 1 worker node then the 1 worker node fails to come up. NOTE: The worker VMs are created successfully, it just fails to join as a node.

karansinghneu commented 1 year ago

@CecileRobertMichon I think I have reached a point where I am now intermittently hitting this: https://github.com/kubernetes-sigs/cluster-api/issues/6029

CecileRobertMichon commented 1 year ago

@karansinghneu did you ever figure this one out? Is there anything that needs to be fixed in CAPZ and/or CAPI?

karansinghneu commented 1 year ago

As far as I recall it was a minor mistake from my end where I used an incorrect region name in the subdomain of the FQDN field. I should have closed this earlier, sorry about that.