metal3-io / metal3-dev-env

Metal³ Development Environment
Apache License 2.0
112 stars 118 forks source link

cluster provisioning stuck #895

Closed faisalchishtii closed 2 years ago

faisalchishtii commented 2 years ago

I ran the make command it completed okay.

Then I tried to provision cluster using below scripts:

./scripts/provision/cluster.sh
./scripts/provision/controlplane.sh
./scripts/provision/worker.sh

These completed too but the two machines are stuck in Pending and Provisioning state.

$ kubectl get machines -A NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION metal3 test1-6d8cc5965f-jvb69 test1 Pending 24m v1.22.3 metal3 test1-8744d test1 Provisioning 31m v1.22.3

$ kubectl describe machine test1-8744d -n metal3 Name: test1-8744d Namespace: metal3 Labels: cluster.x-k8s.io/cluster-name=test1 cluster.x-k8s.io/control-plane= Annotations: controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: {"etcd":{},"networking":{},"apiServer":{},"controllerManager":{},"scheduler":{},"dns":{}} API Version: cluster.x-k8s.io/v1beta1 Kind: Machine Metadata: Creation Timestamp: 2022-01-04T12:02:27Z Finalizers: machine.cluster.x-k8s.io Generation: 2 Managed Fields: API Version: cluster.x-k8s.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: f:finalizers: .: v:"machine.cluster.x-k8s.io": f:labels: .: f:cluster.x-k8s.io/cluster-name: f:cluster.x-k8s.io/control-plane: f:ownerReferences: .: k:{"uid":"bbff0b47-e940-4774-a65b-4dcfd790f970"}: f:spec: .: f:bootstrap: .: f:configRef: .: f:apiVersion: f:kind: f:name: f:namespace: f:uid: f:dataSecretName: f:clusterName: f:infrastructureRef: .: f:apiVersion: f:kind: f:name: f:namespace: f:uid: f:nodeDrainTimeout: f:version: Manager: manager Operation: Update Time: 2022-01-04T12:02:28Z API Version: cluster.x-k8s.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:bootstrapReady: f:conditions: f:lastUpdated: f:observedGeneration: f:phase: Manager: manager Operation: Update Subresource: status Time: 2022-01-04T12:02:28Z Owner References: API Version: controlplane.cluster.x-k8s.io/v1beta1 Block Owner Deletion: true Controller: true Kind: KubeadmControlPlane Name: test1 UID: bbff0b47-e940-4774-a65b-4dcfd790f970 Resource Version: 24627 UID: 92b7feaa-8f4f-4b6e-a31f-27ccbe4af037 Spec: Bootstrap: Config Ref: API Version: bootstrap.cluster.x-k8s.io/v1beta1 Kind: KubeadmConfig Name: test1-h8dvr Namespace: metal3 UID: 831ba605-4d5a-4f81-ae43-b732c08b0f6d Data Secret Name: test1-h8dvr Cluster Name: test1 Infrastructure Ref: API Version: infrastructure.cluster.x-k8s.io/v1beta1 Kind: Metal3Machine Name: test1-controlplane-d7hrx Namespace: metal3 UID: de3cae1f-698d-4364-9c5b-54e1f3de1cc5 Node Drain Timeout: 0s Version: v1.22.3 Status: Bootstrap Ready: true Conditions: Last Transition Time: 2022-01-04T12:05:14Z Message: 1 of 2 completed Reason: DrainingFailed Severity: Error Status: False Type: Ready Last Transition Time: 2022-01-04T12:02:28Z Status: True Type: BootstrapReady Last Transition Time: 2022-01-04T12:05:14Z Message: requeue in: 30s Reason: DrainingFailed Severity: Error Status: False Type: InfrastructureReady Last Transition Time: 2022-01-04T12:02:28Z Reason: WaitingForNodeRef Severity: Info Status: False Type: NodeHealthy Last Updated: 2022-01-04T12:02:28Z Observed Generation: 2 Phase: Provisionings Events:

$ kubectl describe machine test1-6d8cc5965f-jvb69 -n metal3 Name: test1-6d8cc5965f-jvb69 Namespace: metal3 Labels: cluster.x-k8s.io/cluster-name=test1 machine-template-hash=2847715219 nodepool=nodepool-0 Annotations: API Version: cluster.x-k8s.io/v1beta1 Kind: Machine Metadata: Creation Timestamp: 2022-01-04T12:09:32Z Finalizers: machine.cluster.x-k8s.io Generate Name: test1-6d8cc5965f- Generation: 1 Managed Fields: API Version: cluster.x-k8s.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"machine.cluster.x-k8s.io": f:generateName: f:labels: .: f:cluster.x-k8s.io/cluster-name: f:machine-template-hash: f:nodepool: f:ownerReferences: .: k:{"uid":"775224f3-b89f-4020-b6e4-c90a272d8836"}: f:spec: .: f:bootstrap: .: f:configRef: .: f:apiVersion: f:kind: f:name: f:namespace: f:uid: f:clusterName: f:infrastructureRef: .: f:apiVersion: f:kind: f:name: f:namespace: f:uid: f:nodeDrainTimeout: f:version: Manager: manager Operation: Update Time: 2022-01-04T12:09:32Z API Version: cluster.x-k8s.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:conditions: f:lastUpdated: f:observedGeneration: f:phase: Manager: manager Operation: Update Subresource: status Time: 2022-01-04T12:09:32Z Owner References: API Version: cluster.x-k8s.io/v1beta1 Block Owner Deletion: true Controller: true Kind: MachineSet Name: test1-6d8cc5965f UID: 775224f3-b89f-4020-b6e4-c90a272d8836 Resource Version: 26265 UID: be3cdf3c-86e8-47a2-9901-ca4b787da5f0 Spec: Bootstrap: Config Ref: API Version: bootstrap.cluster.x-k8s.io/v1beta1 Kind: KubeadmConfig Name: test1-workers-4f2wq Namespace: metal3 UID: 7c26f42c-7201-43a5-b5ea-2ae117edfdb3 Cluster Name: test1 Infrastructure Ref: API Version: infrastructure.cluster.x-k8s.io/v1beta1 Kind: Metal3Machine Name: test1-workers-7lf8s Namespace: metal3 UID: b909dd29-1a0e-4e04-b331-86af822f8e16 Node Drain Timeout: 0s Version: v1.22.3 Status: Conditions: Last Transition Time: 2022-01-04T12:09:32Z Message: 0 of 2 completed Reason: WaitingForBootstrapReady Severity: Info Status: False Type: Ready Last Transition Time: 2022-01-04T12:09:32Z Reason: WaitingForControlPlaneAvailable Severity: Info Status: False Type: BootstrapReady Last Transition Time: 2022-01-04T12:09:32Z Reason: WaitingForBootstrapReady Severity: Info Status: False Type: InfrastructureReady Last Transition Time: 2022-01-04T12:09:32Z Reason: WaitingForNodeRef Severity: Info Status: False Type: NodeHealthy Last Updated: 2022-01-04T12:09:32Z Observed Generation: 1 Phase: Pending Events:

furkatgofurov7 commented 2 years ago

@faisalchishtii hi!

Thanks for reporting. I see "Drain failed" error from the logs provided, but that could be due to different reasons. One example could be:

  1. provisioning ctpl&worker triggered
  2. While they are provisioning, deprovisioning scripts (under /scrips/deprovision) have been run (without giving it enough time to provision) to deprovision the ctpl&worker but cluster is left out without being deprovisioned.
  3. Again try to provision the same resources.

What I suggest is try to run deprovisioning scripts in the reverse order compared to provisioning scripts (deprovision/worker.sh, deprovision/controlplane.sh and deprovision/cluster.sh) and once we make sure all resources are gone, run provisioning scripts again in the same order you mentioned above. Usually, provisioning of control plane and the worker takes some time, and that is the reason why you see one of them in Provisioning (control-plane) and the other one in Pending (worker), but after a while ~5-7mins roughly, they both should become provisioned.

If that will not help, please attach the output of listing all pods/resources (kubectl get pods/bmh/m3m/machine -A) and some more logs from capi-controller-manager and capm3-controller-manager pods.

furkatgofurov7 commented 2 years ago

priority/awaiting-more-evidence

faisalchishtii commented 2 years ago

@furkatgofurov7 Thanks for the quick response. I tried above but still facing the same issue.

Listed all machines below

$ kubectl get machines -A

NAMESPACE   NAME                     CLUSTER   NODENAME   PROVIDERID   PHASE          AGE   VERSION
metal3      test1-6d8cc5965f-zstvg   test1                             Pending        22m   v1.22.3
metal3      test1-rbbds              test1                             Provisioning   23m   v1.22.3

Listed all pods below

$ kubectl get pods -A

NAMESPACE                           NAME                                                            READY   STATUS    RESTARTS   AGE
baremetal-operator-system           baremetal-operator-controller-manager-545c6f8596-vxprq          2/2     Running   0          5h30m
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-6799c7bd56-lttvc      1/1     Running   0          5h31m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-8d465b66d-l26rw   1/1     Running   0          5h31m
capi-system                         capi-controller-manager-69d4577477-2zjg4                        1/1     Running   0          5h31m
capm3-system                        capm3-controller-manager-dbdc85bbf-5phc5                        1/1     Running   0          5h30m
capm3-system                        ipam-controller-manager-68bb9b98b4-jrshs                        1/1     Running   0          5h30m
cert-manager                        cert-manager-848f547974-9hwkc                                   1/1     Running   0          5h31m
cert-manager                        cert-manager-cainjector-54f4cc6b5-jmtnd                         1/1     Running   0          5h31m
cert-manager                        cert-manager-webhook-7c9588c76-dtcwq                            1/1     Running   0          5h31m
kube-system                         coredns-78fcd69978-7hdt7                                        1/1     Running   0          5h31m
kube-system                         coredns-78fcd69978-jblmx                                        1/1     Running   0          5h31m
kube-system                         etcd-kind-control-plane                                         1/1     Running   0          5h32m
kube-system                         kindnet-vb47l                                                   1/1     Running   0          5h31m
kube-system                         kube-apiserver-kind-control-plane                               1/1     Running   0          5h32m
kube-system                         kube-controller-manager-kind-control-plane                      1/1     Running   0          5h32m
kube-system                         kube-proxy-6f9sl                                                1/1     Running   0          5h31m
kube-system                         kube-scheduler-kind-control-plane                               1/1     Running   0          5h32m
local-path-storage                  local-path-provisioner-85494db59d-jcxzw                         1/1     Running   0          5h31m

Listed all bmh below

$ kubectl get bmh -A

NAMESPACE   NAME     STATE         CONSUMER                   ONLINE   ERROR   AGE
metal3      node-0   available                                true             5h31m
metal3      node-1   available                                true             5h31m
metal3      node-2   available                                false            5h31m
metal3      node-3   available                                true             5h31m
metal3      node-4   available                                true             5h31m
metal3      node-5   available                                true             5h31m
metal3      node-6   available                                true             5h31m
metal3      node-7   provisioned   test1-controlplane-rrspl   true             5h31m
metal3      node-8   available                                true             5h31m

Listed all m3m below

$ kubectl get m3m -A

NAMESPACE   NAME                       AGE   PROVIDERID   READY   CLUSTER   PHASE
metal3      test1-controlplane-rrspl   43m                        test1     
metal3      test1-workers-jwpzx        43m                        test1

See pod logs below.

$ kubectl logs baremetal-operator-controller-manager-545c6f8596-vxprq manager -n baremetal-operator-system

{"level":"info","ts":1641314695.1507816,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-3","node":"bd7886b8-8aaf-4b28-881a-52bd2db0e3c4","size":10} {"level":"info","ts":1641314695.1508417,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-3"} {"level":"info","ts":1641314695.1509423,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-3"} {"level":"info","ts":1641314695.1697648,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-5"} {"level":"info","ts":1641314695.2236907,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"metal3/node-5","node":"a462ad6c-3821-4d87-95b6-42ee7e93e0f4"} {"level":"info","ts":1641314695.3128333,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-5","node":"a462ad6c-3821-4d87-95b6-42ee7e93e0f4","size":10} {"level":"info","ts":1641314695.3128867,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-5"} {"level":"info","ts":1641314695.3129978,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-5"} {"level":"info","ts":1641314695.3274188,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-0"} {"level":"info","ts":1641314695.3775115,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"metal3/node-0","node":"db03a7f8-91a7-486c-8508-11d7187913e8"} {"level":"info","ts":1641314695.452342,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-0","node":"db03a7f8-91a7-486c-8508-11d7187913e8","size":0} {"level":"info","ts":1641314695.452386,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-0"} {"level":"info","ts":1641314695.452437,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-0"} {"level":"info","ts":1641314695.4621296,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-6"} {"level":"info","ts":1641314695.5159986,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"metal3/node-6","node":"02da32c3-5300-4657-94ac-24af0a387df2"} {"level":"info","ts":1641314695.5853393,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-6","node":"02da32c3-5300-4657-94ac-24af0a387df2","size":0} {"level":"info","ts":1641314695.5853763,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-6"} {"level":"info","ts":1641314695.5855024,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-6"} {"level":"info","ts":1641314695.6228101,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-2"}

$kubectl logs capi-kubeadm-bootstrap-controller-manager-6799c7bd56-lttvc -n capi-kubeadm-bootstrap-system

I0104 11:28:13.321993 1 request.go:665] Waited for 1.027918033s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/cert-manager.io/v1alpha3?timeout=32s I0104 11:28:13.525732 1 logr.go:249] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="localhost:8080" I0104 11:28:13.527293 1 logr.go:249] controller-runtime/builder "msg"="skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfig"} I0104 11:28:13.527349 1 logr.go:249] controller-runtime/builder "msg"="Registering a validating webhook" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfig"} "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfig" I0104 11:28:13.527501 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfig" I0104 11:28:13.527735 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/convert" I0104 11:28:13.527868 1 logr.go:249] controller-runtime/builder "msg"="Conversion webhook enabled" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfig"} I0104 11:28:13.527905 1 logr.go:249] controller-runtime/builder "msg"="skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfigTemplate"} I0104 11:28:13.527943 1 logr.go:249] controller-runtime/builder "msg"="Registering a validating webhook" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfigTemplate"} "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfigtemplate" I0104 11:28:13.528028 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfigtemplate" I0104 11:28:13.528153 1 logr.go:249] controller-runtime/builder "msg"="Conversion webhook enabled" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfigTemplate"} I0104 11:28:13.528288 1 logr.go:249] setup "msg"="starting manager" "version"="" I0104 11:28:13.528404 1 server.go:214] controller-runtime/webhook/webhooks "msg"="Starting webhook server"
I0104 11:28:13.528606 1 internal.go:362] "msg"="Starting server" "addr"={"IP":"::","Port":9440,"Zone":""} "kind"="health probe" I0104 11:28:13.528631 1 leaderelection.go:248] attempting to acquire leader lease capi-kubeadm-bootstrap-system/kubeadm-bootstrap-manager-leader-election-capi... I0104 11:28:13.528760 1 logr.go:249] controller-runtime/certwatcher "msg"="Updated current TLS certificate"
I0104 11:28:13.528760 1 internal.go:362] "msg"="Starting server" "addr"={"IP":"127.0.0.1","Port":8080,"Zone":""} "kind"="metrics" "path"="/metrics" I0104 11:28:13.528958 1 logr.go:249] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=9443 I0104 11:28:13.529020 1 logr.go:249] controller-runtime/certwatcher "msg"="Starting certificate watcher"
I0104 11:28:13.581641 1 leaderelection.go:258] successfully acquired lease capi-kubeadm-bootstrap-system/kubeadm-bootstrap-manager-leader-election-capi I0104 11:28:13.581907 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "source"="kind source: v1beta1.KubeadmConfig" I0104 11:28:13.581959 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "source"="kind source: v1beta1.Machine" I0104 11:28:13.582000 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "source"="kind source: *v1beta1.Cluster" I0104 11:28:13.582028 1 controller.go:186] controller/kubeadmconfig "msg"="Starting Controller" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" I0104 11:28:13.683027 1 controller.go:220] controller/kubeadmconfig "msg"="Starting workers" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "worker count"=10 I0104 12:02:28.034490 1 control_plane_init_mutex.go:99] init-locker "msg"="Attempting to acquire the lock" "cluster-name"="test1" "configmap-name"="test1-lock" "machine-name"="test1-8744d" "namespace"="metal3" I0104 12:02:28.037992 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-8744d" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="23492" I0104 12:02:28.038813 1 kubeadmconfig_controller.go:872] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ControlPlaneEndpoint"="192.168.111.249:6443" I0104 12:02:28.038857 1 kubeadmconfig_controller.go:878] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ClusterName"="test1" I0104 12:02:28.038885 1 kubeadmconfig_controller.go:891] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ServiceSubnet"="10.96.0.0/12" I0104 12:02:28.038916 1 kubeadmconfig_controller.go:897] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "PodSubnet"="192.168.0.0/18" I0104 12:02:28.038946 1 kubeadmconfig_controller.go:904] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubernetesVersion"="v1.22.3" I0104 12:02:28.379651 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-8744d" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="23505" I0104 12:02:28.518949 1 kubeadmconfig_controller.go:943] controller/kubeadmconfig "msg"="bootstrap data secret for KubeadmConfig already exists, updating" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubeadmConfig"="test1-h8dvr" "secret"="test1-h8dvr" I0104 16:18:32.660485 1 control_plane_init_mutex.go:99] init-locker "msg"="Attempting to acquire the lock" "cluster-name"="test1" "configmap-name"="test1-lock" "machine-name"="test1-rbbds" "namespace"="metal3" I0104 16:18:32.683273 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="190942" I0104 16:18:32.684792 1 kubeadmconfig_controller.go:872] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ControlPlaneEndpoint"="192.168.111.249:6443" I0104 16:18:32.684834 1 kubeadmconfig_controller.go:878] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ClusterName"="test1" I0104 16:18:32.684865 1 kubeadmconfig_controller.go:891] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ServiceSubnet"="10.96.0.0/12" I0104 16:18:32.685024 1 kubeadmconfig_controller.go:897] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "PodSubnet"="192.168.0.0/18" I0104 16:18:32.685059 1 kubeadmconfig_controller.go:904] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubernetesVersion"="v1.22.3" I0104 16:18:33.223154 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="190958" I0104 16:18:33.390217 1 kubeadmconfig_controller.go:943] controller/kubeadmconfig "msg"="bootstrap data secret for KubeadmConfig already exists, updating" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubeadmConfig"="test1-7cb7s" "secret"="test1-7cb7s"

kubectl logs -n capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-8d465b66d-l26rw

E0104 16:40:28.527314 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" I0104 16:48:09.407426 1 controller.go:251] controller/kubeadmcontrolplane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:48:13.900424 1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:48:13.905605 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" I0104 16:49:32.714146 1 controller.go:251] controller/kubeadmcontrolplane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:49:38.952392 1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:49:38.955223 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"

$kubectl logs -n capi-system capi-controller-manager-69d4577477-2zjg4

I0104 16:51:19.098737 1 machine_controller_phases.go:221] controller/machine "msg"="Bootstrap provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:19.104282 1 machine_controller_phases.go:283] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:19.104348 1 machine_controller_noderef.go:49] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="test1" "machine"="test1-6d8cc5965f-zstvg" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:41.990234 1 machine_controller_phases.go:283] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="test1" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:41.990308 1 machine_controller_noderef.go:49] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="test1" "machine"="test1-rbbds" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:49.135914 1 machine_controller_phases.go:221] controller/machine "msg"="Bootstrap provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:49.154411 1 machine_controller_phases.go:283] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:49.154478 1 machine_controller_noderef.go:49] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="test1" "machine"="test1-6d8cc5965f-zstvg" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"

$ kubectl logs -n capm3-system capm3-controller-manager-dbdc85bbf-5phc5

I0104 16:52:24.776285 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:24.776336 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:31.869251 1 metal3machine_manager.go:682] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:31.869835 1 metal3machine_manager.go:1134] controllers/Metal3Machine/Metal3Machine-controller "msg"="Deleting nodeReuseLabelName from host, if any" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:31.872542 1 metal3machine_manager.go:741] controllers/Metal3Machine/Metal3Machine-controller "msg"="Finished updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:33.785220 1 metal3labelsync_controller.go:142] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3","Name":"node-7"} I0104 16:52:34.952333 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:34.952393 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:59.272015 1 metal3machinetemplate_manager.go:65] controllers/Metal3MachineTemplate/Metal3MachineTemplate-controller "msg"="Fetching metal3Machine objects" "metal3-machine-template"={"Namespace":"metal3","Name":"test1-controlplane"} I0104 16:52:59.272818 1 metal3machinetemplate_manager.go:65] controllers/Metal3MachineTemplate/Metal3MachineTemplate-controller "msg"="Fetching metal3Machine objects" "metal3-machine-template"={"Namespace":"metal3","Name":"test1-workers"} I0104 16:53:03.788341 1 metal3labelsync_controller.go:142] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3","Name":"node-7"} I0104 16:53:04.956664 1 metal3machine_manager.go:682] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:04.957191 1 metal3machine_manager.go:1134] controllers/Metal3Machine/Metal3Machine-controller "msg"="Deleting nodeReuseLabelName from host, if any" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:04.958812 1 metal3machine_manager.go:741] controllers/Metal3Machine/Metal3Machine-controller "msg"="Finished updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:08.040312 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:08.040379 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:21.291599 1 metal3machine_manager.go:682] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:21.292213 1 metal3machine_manager.go:1134] controllers/Metal3Machine/Metal3Machine-controller "msg"="Deleting nodeReuseLabelName from host, if any" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:21.293917 1 metal3machine_manager.go:741] controllers/Metal3Machine/Metal3Machine-controller "msg"="Finished updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:24.360350 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:24.360417 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:33.791593 1 metal3labelsync_controller.go:142] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3","Name":"node-7"}

$ kubectl logs -n capm3-system ipam-controller-manager-68bb9b98b4-jrshs

I0104 16:49:23.796671 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"provisioning-pool"} I0104 16:49:23.809212 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"baremetalv4-pool"} I0104 16:49:23.890712 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"provisioning-pool"} I0104 16:49:23.892852 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"baremetalv4-pool"}

faisalchishtii commented 2 years ago

OS Details:

# cat /etc/os-release

NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.3 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

furkatgofurov7 commented 2 years ago

@faisalchishtii thanks. I noticed errors in KCP logs: E0104 16:48:13.900424 1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster "metal3/test1": Get "https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:48:13.905605 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster "metal3/test1": Get "https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"

Looks like the API server is not reachable, can you check please if Kubelet is running in the cluster and output of kubectl describe kcp <kcp-name> -n <namespace>?

Also, can you confirm if you meet minimum requirements for the host where you are running m3-dev-env and have imported all required environment vars before running make? Since it is Ubuntu host, you should export:

 export IMAGE_OS=Ubuntu
 export CONTAINER_RUNTIME=docker
 export EPHEMERAL_CLUSTER=kind

and API versions for CAPI/CAPM3 you would like to set, otherwise, they are set as default (v1beta1 for both env vars CAPI_VERSION/CAPM3_VERSION).

faisalchishtii commented 2 years ago

Thank you @furkatgofurov7

I cleaned up using make clean and retried after exporting the variables as you suggested but the API server is still unreachable.

cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"

The kubelet on the management cluster (kind cluster) is running okay.

$ps -ef | grep kubelet

root 1058290 1057488 11 09:29 ? 00:00:01 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --fail-swap-on=false --node-ip=172.20.0.2 --node-labels= --pod-infra-container-image=k8s.gcr.io/pause:3.5 --provider-id=kind://docker/kind/kind-control-plane --fail-swap-on=false --cgroup-root=/kubele

The kubelet on the test1-gpjn6 machine is failing to start:

[metal3@test1-gpjn6 ~]$ sudo tail -f /var/log/messages

Jan  6 08:18:39 test1-gpjn6 kubelet[6761]: E0106 08:18:39.219920    6761 server.go:206] "Failed to load kubelet config file" err="failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory" path="/var/lib/kubelet/config.yaml"
Jan  6 08:18:39 test1-gpjn6 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jan  6 08:18:39 test1-gpjn6 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jan  6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
Jan  6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 85.
Jan  6 08:18:49 test1-gpjn6 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan  6 08:18:49 test1-gpjn6 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan  6 08:18:49 test1-gpjn6 kubelet[6778]: E0106 08:18:49.444436    6778 server.go:206] "Failed to load kubelet config file" err="failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory" path="/var/lib/kubelet/config.yaml"
Jan  6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jan  6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Failed with result 'exit-code'.

It looks like to be stuck when running the kubeadm init thingy. I checked the cloud-init logs and seems like it is failing due to timeout errors. See complete cloud-init log below.

cloud-init-output.log

KCP details below

$ kubectl get kcp -A

NAMESPACE   NAME    CLUSTER   INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
metal3      test1   test1                                          1                  1         1             11h   v1.22.3

adam@nodek9:~$ kubectl describe kcp test1 -n metal3

Name:         test1
Namespace:    metal3
Labels:       cluster.x-k8s.io/cluster-name=test1
Annotations:  <none>
API Version:  controlplane.cluster.x-k8s.io/v1beta1
Kind:         KubeadmControlPlane
Metadata:
  Creation Timestamp:  2022-01-05T15:45:45Z
  Finalizers:
    kubeadm.controlplane.cluster.x-k8s.io
  Generation:  1
  Managed Fields:
    API Version:  controlplane.cluster.x-k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:kubeadmConfigSpec:
          .:
          f:clusterConfiguration:
          f:files:
          f:initConfiguration:
            .:
            f:nodeRegistration:
              .:
              f:kubeletExtraArgs:
                .:
                f:cgroup-driver:
                f:container-runtime:
                f:container-runtime-endpoint:
                f:feature-gates:
                f:node-labels:
                f:provider-id:
                f:runtime-request-timeout:
              f:name:
          f:joinConfiguration:
            .:
            f:controlPlane:
            f:nodeRegistration:
              .:
              f:kubeletExtraArgs:
                .:
                f:cgroup-driver:
                f:container-runtime:
                f:container-runtime-endpoint:
                f:feature-gates:
                f:node-labels:
                f:provider-id:
                f:runtime-request-timeout:
              f:name:
          f:postKubeadmCommands:
          f:preKubeadmCommands:
          f:users:
        f:machineTemplate:
          .:
          f:infrastructureRef:
            .:
            f:apiVersion:
            f:kind:
            f:name:
          f:nodeDrainTimeout:
        f:replicas:
        f:rolloutStrategy:
          .:
          f:rollingUpdate:
            .:
            f:maxSurge:
        f:version:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-01-05T15:45:45Z
    API Version:  controlplane.cluster.x-k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"kubeadm.controlplane.cluster.x-k8s.io":
        f:labels:
          .:
          f:cluster.x-k8s.io/cluster-name:
        f:ownerReferences:
          .:
          k:{"uid":"228583f0-855a-4913-bf32-1316b16260fb"}:
    Manager:      manager
    Operation:    Update
    Time:         2022-01-05T15:45:48Z
    API Version:  controlplane.cluster.x-k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:observedGeneration:
        f:replicas:
        f:selector:
        f:unavailableReplicas:
        f:updatedReplicas:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2022-01-05T15:46:00Z
  Owner References:
    API Version:           cluster.x-k8s.io/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Cluster
    Name:                  test1
    UID:                   228583f0-855a-4913-bf32-1316b16260fb
  Resource Version:        18293
  UID:                     308cdc39-4e60-4c34-9776-c24a61bdfcda
Spec:
  Kubeadm Config Spec:
    Cluster Configuration:
      API Server:
      Controller Manager:
      Dns:
      Etcd:
      Networking:
      Scheduler:
    Files:
      Content:  #!/bin/bash
set -e
url="$1"
dst="$2"
filename="$(basename $url)"
tmpfile="/tmp/$filename"
curl -sSL -w "%{http_code}" "$url" | sed "s:/usr/bin:/usr/local/bin:g" > /tmp/"$filename"
http_status=$(cat "$tmpfile" | tail -n 1)
if [ "$http_status" != "200" ]; then
  echo "Error: unable to retrieve $filename file";
  exit 1;
else
  cat "$tmpfile"| sed '$d' > "$dst";
fi

      Owner:        root:root
      Path:         /usr/local/bin/retrieve.configuration.files.sh
      Permissions:  0755
      Content:      #!/bin/bash
while :; do
  curl -sk https://127.0.0.1:6443/healthz 1>&2 > /dev/null
  isOk=$?
  isActive=$(systemctl show -p ActiveState keepalived.service | cut -d'=' -f2)
  if [ $isOk == "0" ] &&  [ $isActive != "active" ]; then
    logger 'API server is healthy, however keepalived is not running, starting keepalived'
    echo 'API server is healthy, however keepalived is not running, starting keepalived'
    sudo systemctl start keepalived.service
  elif [ $isOk != "0" ] &&  [ $isActive == "active" ]; then
    logger 'API server is not healthy, however keepalived running, stopping keepalived'
    echo 'API server is not healthy, however keepalived running, stopping keepalived'
    sudo systemctl stop keepalived.service
  fi
  sleep 5
done

      Owner:        root:root
      Path:         /usr/local/bin/monitor.keepalived.sh
      Permissions:  0755
      Content:      [Unit]
Description=Monitors keepalived adjusts status with that of API server
After=syslog.target network-online.target
[Service]
Type=simple
Restart=always
ExecStart=/usr/local/bin/monitor.keepalived.sh
[Install]
WantedBy=multi-user.target

      Owner:    root:root
      Path:     /lib/systemd/system/monitor.keepalived.service
      Content:  ! Configuration File for keepalived
global_defs {
    notification_email {
    sysadmin@example.com
    support@example.com
    }
    notification_email_from lb@example.com
    smtp_server localhost
    smtp_connect_timeout 30
}
vrrp_instance VI_1 {
    state MASTER
    interface eth1
    virtual_router_id 1
    priority 101
    advert_int 1
    virtual_ipaddress {
        192.168.111.249
    }
}

      Path:     /etc/keepalived/keepalived.conf
      Content:  BOOTPROTO=none
DEVICE=eth0
ONBOOT=yes
TYPE=Ethernet
USERCTL=no
BRIDGE=ironicendpoint

      Owner:        root:root
      Path:         /etc/sysconfig/network-scripts/ifcfg-eth0
      Permissions:  0644
      Content:      TYPE=Bridge
DEVICE=ironicendpoint
ONBOOT=yes
USERCTL=no
BOOTPROTO="static"
IPADDR={{ ds.meta_data.provisioningIP }}
PREFIX={{ ds.meta_data.provisioningCIDR }}

      Owner:        root:root
      Path:         /etc/sysconfig/network-scripts/ifcfg-ironicendpoint
      Permissions:  0644
      Content:      [kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=0
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg

      Owner:        root:root
      Path:         /etc/yum.repos.d/kubernetes.repo
      Permissions:  0644
      Content:      [registries.search]
registries = ['docker.io']

[registries.insecure]
registries = ['192.168.111.1:5000']

      Path:  /etc/containers/registries.conf
    Init Configuration:
      Local API Endpoint:
      Node Registration:
        Kubelet Extra Args:
          Cgroup - Driver:                 systemd
          Container - Runtime:             remote
          Container - Runtime - Endpoint:  unix:///var/run/crio/crio.sock
          Feature - Gates:                 AllAlpha=false
          Node - Labels:                   metal3.io/uuid={{ ds.meta_data.uuid }}
          Provider - Id:                   metal3://{{ ds.meta_data.uuid }}
          Runtime - Request - Timeout:     5m
        Name:                              {{ ds.meta_data.name }}
    Join Configuration:
      Control Plane:
        Local API Endpoint:
      Discovery:
      Node Registration:
        Kubelet Extra Args:
          Cgroup - Driver:                 systemd
          Container - Runtime:             remote
          Container - Runtime - Endpoint:  unix:///var/run/crio/crio.sock
          Feature - Gates:                 AllAlpha=false
          Node - Labels:                   metal3.io/uuid={{ ds.meta_data.uuid }}
          Provider - Id:                   metal3://{{ ds.meta_data.uuid }}
          Runtime - Request - Timeout:     5m
        Name:                              {{ ds.meta_data.name }}
    Post Kubeadm Commands:
      mkdir -p /home/metal3/.kube
      cp /etc/kubernetes/admin.conf /home/metal3/.kube/config
      chown metal3:metal3 /home/metal3/.kube/config
    Pre Kubeadm Commands:
      systemctl restart NetworkManager.service
      ifup eth0
      systemctl enable --now crio keepalived kubelet
      systemctl link /lib/systemd/system/monitor.keepalived.service
      systemctl enable monitor.keepalived.service
      systemctl start monitor.keepalived.service
    Users:
      Name:  metal3
      Ssh Authorized Keys:
        ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDgLACiOeo0CyvVCbVVlL971th+w3uKvMskhSsPIkkBwQl+g+/QnfSxZitK/Ihzpiom5jkOA5T7GDm7mm+xvk+55lDa7ahm90sutD4/SaDw7ylLEUQc4EF20MtddckfpkazDKRZx0Yt6PJxN54INsDxDz3D9bklXEGzjh1OBFw08ayWQC+f4JZ9FUvIMaemAWRzCdBgeiECzIZZhB4d/fgFSGMG/RUG+sXAAHRXY0/oTGmlVvCzJsPy4yoR3rKxONzmyOK1qfHRHqr+pCzKD9abV7A9iyIiDwlTUclNYRCHE2laiYcGjje9IZAr/sPzmMYwwu+yvyxcLEPtXrHo/nyyGAc0TBu676tioMc1NnjLMcE/bRhI+zeCw0qhqFR6XTx0oMxrf8ofhJjC4t4RDvmqRDxV9gmcRBvLRJe82YAlHAX8GRB3hfLtpnxYnrFT/SqX+lyJspyinauNoU9OWB2ejCbEf0V4Ot8j/AlN7Fkr4A1PNV3qV001YEXDi0j4nR0= adam@nodek9
      Sudo:  ALL=(ALL) NOPASSWD:ALL
  Machine Template:
    Infrastructure Ref:
      API Version:  infrastructure.cluster.x-k8s.io/v1beta1
      Kind:         Metal3MachineTemplate
      Name:         test1-controlplane
      Namespace:    metal3
    Metadata:
    Node Drain Timeout:  0s
  Replicas:              1
  Rollout Strategy:
    Rolling Update:
      Max Surge:  1
    Type:         RollingUpdate
  Version:        v1.22.3
Status:
  Conditions:
    Last Transition Time:  2022-01-05T15:48:16Z
    Message:               1 of 2 completed
    Reason:                DrainingFailed @ Machine/test1-qgkdh
    Severity:              Error
    Status:                False
    Type:                  Ready
    Last Transition Time:  2022-01-05T15:45:50Z
    Reason:                WaitingForKubeadmInit
    Severity:              Info
    Status:                False
    Type:                  Available
    Last Transition Time:  2022-01-05T15:45:50Z
    Status:                True
    Type:                  CertificatesAvailable
    Last Transition Time:  2022-01-05T15:45:57Z
    Status:                True
    Type:                  MachinesCreated
    Last Transition Time:  2022-01-05T15:48:16Z
    Message:               1 of 2 completed
    Reason:                DrainingFailed @ Machine/test1-qgkdh
    Severity:              Error
    Status:                False
    Type:                  MachinesReady
    Last Transition Time:  2022-01-05T15:46:00Z
    Status:                True
    Type:                  Resized
  Observed Generation:     1
  Replicas:                1
  Selector:                cluster.x-k8s.io/cluster-name=test1,cluster.x-k8s.io/control-plane
  Unavailable Replicas:    1
  Updated Replicas:        1
Events:                    <none>
furkatgofurov7 commented 2 years ago

It looks like to be stuck when running the kubeadm init thingy. I checked the cloud-init logs and seems like it is failing due to timeout errors. See complete cloud-init log below.

cloud-init-output.log

@faisalchishtii this is helpful, thanks. As predicted, for some reason, kubelet is not running properly on that machine and from the cloud-init logs you have attached I found this:

error execution phase preflight: [preflight] Some fatal errors occurred: [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.22.3: output: time="2022-01-06T09:44:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:60557->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.22.3: output: time="2022-01-06T09:49:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:56166->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.22.3: output: time="2022-01-06T09:54:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:32973->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.22.3: output: time="2022-01-06T09:59:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:50290->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.5: output: time="2022-01-06T10:04:14Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:36564->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.5.0-0: output: time="2022-01-06T10:09:14Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:43208->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.4: output: time="2022-01-06T10:14:14Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:46906->8.8.4.4:53: i/o timeout" , error: exit status 1

While doing kubeadm init in the machine, it failed to pull images from the k8s container registry and that caused the kubeadm failure. Also, the error message from the kubelet logs complaining with: Failed to load kubelet config file" err="failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory" path="/var/lib/kubelet/config.yaml" is a well-known issue reported several times in upstream k8s repositories, i.e (https://github.com/kubernetes/kubernetes/issues/73779, https://github.com/kubernetes/kubernetes/issues/65863). When kubeadm init/join is called /var/lib/kubelet/config.yaml file should be created and the kubelet should start properly. Without that file, it is failing and the reason for not being able to create that file could be not being able to pull images from k8s container registry.

That being said, it does not seem to be a metal3 issue but rather an environment issue where you are running make. Next question is, are you by any chance behind a corporate proxy? Could you try the solutions suggested on this thread to fix pulling images during kubeadm init and once we make sure it can pull images properly, re-run metal3-dev-env again?

furkatgofurov7 commented 2 years ago

Also, you can verify if all needed images can be pulled successfully by running: kubeadm config images pull

faisalchishtii commented 2 years ago

It turned out that the dns resolution was failing on the target machines/bmh hosts (centos vms created by libvirt) even though it is configured correctly. While I couldn't understand the root cause, I was able to temporarily fix it by manually adding IP's for k8s.gcr.io and few other domains in /etc/hosts.

This is how my /etc/hosts looks like to make it to work temporarily:

$ sudo cat /etc/hosts 
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 airship-ci-centos-node-img-a85794d843
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

142.251.10.82     k8s.gcr.io
34.107.204.206    dl.k8s.io
142.250.193.132   googleapis.com
172.217.160.144   storage.googleapis.com

Thanks for all the help @furkatgofurov7

furkatgofurov7 commented 2 years ago

It turned out that the dns resolution was failing on the target machines/bmh hosts (centos vms created by libvirt) even though it is configured correctly. While I couldn't understand the root cause, I was able to temporarily fix it by manually adding IP's for k8s.gcr.io and few other domains in /etc/hosts.

This is how my /etc/hosts looks like to make it to work temporarily:

$ sudo cat /etc/hosts 
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 airship-ci-centos-node-img-a85794d843
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

142.251.10.82     k8s.gcr.io
34.107.204.206    dl.k8s.io
142.250.193.132   googleapis.com
172.217.160.144   storage.googleapis.com

Thanks for all the help @furkatgofurov7

No problem. Closing this since it has been a problem with an environment, please feel free to reopen it if you want to discuss more on this issue.

/close

metal3-io-bot commented 2 years ago

@furkatgofurov7: Closing this issue.

In response to [this](https://github.com/metal3-io/metal3-dev-env/issues/895#issuecomment-1007797540): >> It turned out that the dns resolution was failing on the target machines/bmh hosts (centos vms created by libvirt) even though it is configured correctly. While I couldn't understand the root cause, I was able to temporarily fix it by manually adding IP's for `k8s.gcr.io` and few other domains in `/etc/hosts`. >> >> This is how my `/etc/hosts` looks like to make it to work temporarily: >> >> ``` >> $ sudo cat /etc/hosts >> 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 airship-ci-centos-node-img-a85794d843 >> ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 >> >> 142.251.10.82 k8s.gcr.io >> 34.107.204.206 dl.k8s.io >> 142.250.193.132 googleapis.com >> 172.217.160.144 storage.googleapis.com >> ``` >> >> Thanks for all the help @furkatgofurov7 > >No problem. Closing this since it has been a problem with an environment, please feel free to reopen it if you want to discuss more on this issue. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.