Closed faisalchishtii closed 2 years ago
@faisalchishtii hi!
Thanks for reporting. I see "Drain failed" error from the logs provided, but that could be due to different reasons. One example could be:
What I suggest is try to run deprovisioning scripts in the reverse order compared to provisioning scripts (deprovision/worker.sh, deprovision/controlplane.sh and deprovision/cluster.sh) and once we make sure all resources are gone, run provisioning scripts again in the same order you mentioned above. Usually, provisioning of control plane and the worker takes some time, and that is the reason why you see one of them in Provisioning
(control-plane) and the other one in Pending
(worker), but after a while ~5-7mins roughly, they both should become provisioned.
If that will not help, please attach the output of listing all pods/resources (kubectl get pods/bmh/m3m/machine -A
) and some more logs from capi-controller-manager
and capm3-controller-manager
pods.
priority/awaiting-more-evidence
@furkatgofurov7 Thanks for the quick response. I tried above but still facing the same issue.
$ kubectl get machines -A
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
metal3 test1-6d8cc5965f-zstvg test1 Pending 22m v1.22.3
metal3 test1-rbbds test1 Provisioning 23m v1.22.3
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
baremetal-operator-system baremetal-operator-controller-manager-545c6f8596-vxprq 2/2 Running 0 5h30m
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-6799c7bd56-lttvc 1/1 Running 0 5h31m
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-8d465b66d-l26rw 1/1 Running 0 5h31m
capi-system capi-controller-manager-69d4577477-2zjg4 1/1 Running 0 5h31m
capm3-system capm3-controller-manager-dbdc85bbf-5phc5 1/1 Running 0 5h30m
capm3-system ipam-controller-manager-68bb9b98b4-jrshs 1/1 Running 0 5h30m
cert-manager cert-manager-848f547974-9hwkc 1/1 Running 0 5h31m
cert-manager cert-manager-cainjector-54f4cc6b5-jmtnd 1/1 Running 0 5h31m
cert-manager cert-manager-webhook-7c9588c76-dtcwq 1/1 Running 0 5h31m
kube-system coredns-78fcd69978-7hdt7 1/1 Running 0 5h31m
kube-system coredns-78fcd69978-jblmx 1/1 Running 0 5h31m
kube-system etcd-kind-control-plane 1/1 Running 0 5h32m
kube-system kindnet-vb47l 1/1 Running 0 5h31m
kube-system kube-apiserver-kind-control-plane 1/1 Running 0 5h32m
kube-system kube-controller-manager-kind-control-plane 1/1 Running 0 5h32m
kube-system kube-proxy-6f9sl 1/1 Running 0 5h31m
kube-system kube-scheduler-kind-control-plane 1/1 Running 0 5h32m
local-path-storage local-path-provisioner-85494db59d-jcxzw 1/1 Running 0 5h31m
$ kubectl get bmh -A
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
metal3 node-0 available true 5h31m
metal3 node-1 available true 5h31m
metal3 node-2 available false 5h31m
metal3 node-3 available true 5h31m
metal3 node-4 available true 5h31m
metal3 node-5 available true 5h31m
metal3 node-6 available true 5h31m
metal3 node-7 provisioned test1-controlplane-rrspl true 5h31m
metal3 node-8 available true 5h31m
$ kubectl get m3m -A
NAMESPACE NAME AGE PROVIDERID READY CLUSTER PHASE
metal3 test1-controlplane-rrspl 43m test1
metal3 test1-workers-jwpzx 43m test1
$ kubectl logs baremetal-operator-controller-manager-545c6f8596-vxprq manager -n baremetal-operator-system
{"level":"info","ts":1641314695.1507816,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-3","node":"bd7886b8-8aaf-4b28-881a-52bd2db0e3c4","size":10} {"level":"info","ts":1641314695.1508417,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-3"} {"level":"info","ts":1641314695.1509423,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-3"} {"level":"info","ts":1641314695.1697648,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-5"} {"level":"info","ts":1641314695.2236907,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"metal3/node-5","node":"a462ad6c-3821-4d87-95b6-42ee7e93e0f4"} {"level":"info","ts":1641314695.3128333,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-5","node":"a462ad6c-3821-4d87-95b6-42ee7e93e0f4","size":10} {"level":"info","ts":1641314695.3128867,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-5"} {"level":"info","ts":1641314695.3129978,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-5"} {"level":"info","ts":1641314695.3274188,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-0"} {"level":"info","ts":1641314695.3775115,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"metal3/node-0","node":"db03a7f8-91a7-486c-8508-11d7187913e8"} {"level":"info","ts":1641314695.452342,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-0","node":"db03a7f8-91a7-486c-8508-11d7187913e8","size":0} {"level":"info","ts":1641314695.452386,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-0"} {"level":"info","ts":1641314695.452437,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-0"} {"level":"info","ts":1641314695.4621296,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-6"} {"level":"info","ts":1641314695.5159986,"logger":"controllers.HostFirmwareSettings","msg":"retrieving firmware settings and saving to resource","hostfirmwaresettings":"metal3/node-6","node":"02da32c3-5300-4657-94ac-24af0a387df2"} {"level":"info","ts":1641314695.5853393,"logger":"provisioner.ironic","msg":"retrieved BIOS settings for node","host":"metal3~node-6","node":"02da32c3-5300-4657-94ac-24af0a387df2","size":0} {"level":"info","ts":1641314695.5853763,"logger":"controllers.HostFirmwareSettings","msg":"getting firmwareSchema","hostfirmwaresettings":"metal3/node-6"} {"level":"info","ts":1641314695.5855024,"logger":"controllers.HostFirmwareSettings","msg":"found existing firmwareSchema resource","hostfirmwaresettings":"metal3/node-6"} {"level":"info","ts":1641314695.6228101,"logger":"controllers.HostFirmwareSettings","msg":"start","hostfirmwaresettings":"metal3/node-2"}
$kubectl logs capi-kubeadm-bootstrap-controller-manager-6799c7bd56-lttvc -n capi-kubeadm-bootstrap-system
I0104 11:28:13.321993 1 request.go:665] Waited for 1.027918033s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/cert-manager.io/v1alpha3?timeout=32s I0104 11:28:13.525732 1 logr.go:249] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="localhost:8080" I0104 11:28:13.527293 1 logr.go:249] controller-runtime/builder "msg"="skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfig"} I0104 11:28:13.527349 1 logr.go:249] controller-runtime/builder "msg"="Registering a validating webhook" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfig"} "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfig" I0104 11:28:13.527501 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfig" I0104 11:28:13.527735 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/convert" I0104 11:28:13.527868 1 logr.go:249] controller-runtime/builder "msg"="Conversion webhook enabled" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfig"} I0104 11:28:13.527905 1 logr.go:249] controller-runtime/builder "msg"="skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfigTemplate"} I0104 11:28:13.527943 1 logr.go:249] controller-runtime/builder "msg"="Registering a validating webhook" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfigTemplate"} "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfigtemplate" I0104 11:28:13.528028 1 server.go:146] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-bootstrap-cluster-x-k8s-io-v1beta1-kubeadmconfigtemplate" I0104 11:28:13.528153 1 logr.go:249] controller-runtime/builder "msg"="Conversion webhook enabled" "GVK"={"Group":"bootstrap.cluster.x-k8s.io","Version":"v1beta1","Kind":"KubeadmConfigTemplate"} I0104 11:28:13.528288 1 logr.go:249] setup "msg"="starting manager" "version"="" I0104 11:28:13.528404 1 server.go:214] controller-runtime/webhook/webhooks "msg"="Starting webhook server"
I0104 11:28:13.528606 1 internal.go:362] "msg"="Starting server" "addr"={"IP":"::","Port":9440,"Zone":""} "kind"="health probe" I0104 11:28:13.528631 1 leaderelection.go:248] attempting to acquire leader lease capi-kubeadm-bootstrap-system/kubeadm-bootstrap-manager-leader-election-capi... I0104 11:28:13.528760 1 logr.go:249] controller-runtime/certwatcher "msg"="Updated current TLS certificate"
I0104 11:28:13.528760 1 internal.go:362] "msg"="Starting server" "addr"={"IP":"127.0.0.1","Port":8080,"Zone":""} "kind"="metrics" "path"="/metrics" I0104 11:28:13.528958 1 logr.go:249] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=9443 I0104 11:28:13.529020 1 logr.go:249] controller-runtime/certwatcher "msg"="Starting certificate watcher"
I0104 11:28:13.581641 1 leaderelection.go:258] successfully acquired lease capi-kubeadm-bootstrap-system/kubeadm-bootstrap-manager-leader-election-capi I0104 11:28:13.581907 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "source"="kind source: v1beta1.KubeadmConfig" I0104 11:28:13.581959 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "source"="kind source: v1beta1.Machine" I0104 11:28:13.582000 1 controller.go:178] controller/kubeadmconfig "msg"="Starting EventSource" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "source"="kind source: *v1beta1.Cluster" I0104 11:28:13.582028 1 controller.go:186] controller/kubeadmconfig "msg"="Starting Controller" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" I0104 11:28:13.683027 1 controller.go:220] controller/kubeadmconfig "msg"="Starting workers" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "worker count"=10 I0104 12:02:28.034490 1 control_plane_init_mutex.go:99] init-locker "msg"="Attempting to acquire the lock" "cluster-name"="test1" "configmap-name"="test1-lock" "machine-name"="test1-8744d" "namespace"="metal3" I0104 12:02:28.037992 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-8744d" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="23492" I0104 12:02:28.038813 1 kubeadmconfig_controller.go:872] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ControlPlaneEndpoint"="192.168.111.249:6443" I0104 12:02:28.038857 1 kubeadmconfig_controller.go:878] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ClusterName"="test1" I0104 12:02:28.038885 1 kubeadmconfig_controller.go:891] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ServiceSubnet"="10.96.0.0/12" I0104 12:02:28.038916 1 kubeadmconfig_controller.go:897] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "PodSubnet"="192.168.0.0/18" I0104 12:02:28.038946 1 kubeadmconfig_controller.go:904] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubernetesVersion"="v1.22.3" I0104 12:02:28.379651 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-8744d" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="23505" I0104 12:02:28.518949 1 kubeadmconfig_controller.go:943] controller/kubeadmconfig "msg"="bootstrap data secret for KubeadmConfig already exists, updating" "name"="test1-h8dvr" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubeadmConfig"="test1-h8dvr" "secret"="test1-h8dvr" I0104 16:18:32.660485 1 control_plane_init_mutex.go:99] init-locker "msg"="Attempting to acquire the lock" "cluster-name"="test1" "configmap-name"="test1-lock" "machine-name"="test1-rbbds" "namespace"="metal3" I0104 16:18:32.683273 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="190942" I0104 16:18:32.684792 1 kubeadmconfig_controller.go:872] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ControlPlaneEndpoint"="192.168.111.249:6443" I0104 16:18:32.684834 1 kubeadmconfig_controller.go:878] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ClusterName"="test1" I0104 16:18:32.684865 1 kubeadmconfig_controller.go:891] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "ServiceSubnet"="10.96.0.0/12" I0104 16:18:32.685024 1 kubeadmconfig_controller.go:897] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "PodSubnet"="192.168.0.0/18" I0104 16:18:32.685059 1 kubeadmconfig_controller.go:904] controller/kubeadmconfig "msg"="Altering ClusterConfiguration" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubernetesVersion"="v1.22.3" I0104 16:18:33.223154 1 kubeadmconfig_controller.go:380] controller/kubeadmconfig "msg"="Creating BootstrapData for the init control plane" "kind"="Machine" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="190958" I0104 16:18:33.390217 1 kubeadmconfig_controller.go:943] controller/kubeadmconfig "msg"="bootstrap data secret for KubeadmConfig already exists, updating" "name"="test1-7cb7s" "namespace"="metal3" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "KubeadmConfig"="test1-7cb7s" "secret"="test1-7cb7s"
kubectl logs -n capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-8d465b66d-l26rw
E0104 16:40:28.527314 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" I0104 16:48:09.407426 1 controller.go:251] controller/kubeadmcontrolplane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:48:13.900424 1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:48:13.905605 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" I0104 16:49:32.714146 1 controller.go:251] controller/kubeadmcontrolplane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:49:38.952392 1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:49:38.955223 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
$kubectl logs -n capi-system capi-controller-manager-69d4577477-2zjg4
I0104 16:51:19.098737 1 machine_controller_phases.go:221] controller/machine "msg"="Bootstrap provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:19.104282 1 machine_controller_phases.go:283] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:19.104348 1 machine_controller_noderef.go:49] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="test1" "machine"="test1-6d8cc5965f-zstvg" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:41.990234 1 machine_controller_phases.go:283] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="test1" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:41.990308 1 machine_controller_noderef.go:49] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="test1" "machine"="test1-rbbds" "name"="test1-rbbds" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:49.135914 1 machine_controller_phases.go:221] controller/machine "msg"="Bootstrap provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:49.154411 1 machine_controller_phases.go:283] controller/machine "msg"="Infrastructure provider is not ready, requeuing" "cluster"="test1" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" I0104 16:51:49.154478 1 machine_controller_noderef.go:49] controller/machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "cluster"="test1" "machine"="test1-6d8cc5965f-zstvg" "name"="test1-6d8cc5965f-zstvg" "namespace"="metal3" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
$ kubectl logs -n capm3-system capm3-controller-manager-dbdc85bbf-5phc5
I0104 16:52:24.776285 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:24.776336 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:31.869251 1 metal3machine_manager.go:682] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:31.869835 1 metal3machine_manager.go:1134] controllers/Metal3Machine/Metal3Machine-controller "msg"="Deleting nodeReuseLabelName from host, if any" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:31.872542 1 metal3machine_manager.go:741] controllers/Metal3Machine/Metal3Machine-controller "msg"="Finished updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:33.785220 1 metal3labelsync_controller.go:142] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3","Name":"node-7"} I0104 16:52:34.952333 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:34.952393 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:52:59.272015 1 metal3machinetemplate_manager.go:65] controllers/Metal3MachineTemplate/Metal3MachineTemplate-controller "msg"="Fetching metal3Machine objects" "metal3-machine-template"={"Namespace":"metal3","Name":"test1-controlplane"} I0104 16:52:59.272818 1 metal3machinetemplate_manager.go:65] controllers/Metal3MachineTemplate/Metal3MachineTemplate-controller "msg"="Fetching metal3Machine objects" "metal3-machine-template"={"Namespace":"metal3","Name":"test1-workers"} I0104 16:53:03.788341 1 metal3labelsync_controller.go:142] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3","Name":"node-7"} I0104 16:53:04.956664 1 metal3machine_manager.go:682] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:04.957191 1 metal3machine_manager.go:1134] controllers/Metal3Machine/Metal3Machine-controller "msg"="Deleting nodeReuseLabelName from host, if any" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:04.958812 1 metal3machine_manager.go:741] controllers/Metal3Machine/Metal3Machine-controller "msg"="Finished updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:08.040312 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:08.040379 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:21.291599 1 metal3machine_manager.go:682] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:21.292213 1 metal3machine_manager.go:1134] controllers/Metal3Machine/Metal3Machine-controller "msg"="Deleting nodeReuseLabelName from host, if any" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:21.293917 1 metal3machine_manager.go:741] controllers/Metal3Machine/Metal3Machine-controller "msg"="Finished updating machine" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:24.360350 1 metal3machine_manager.go:1702] controllers/Metal3Machine/Metal3Machine-controller "msg"="error while retrieving nodes with label (metal3.io/uuid=93da9a56-c483-48c3-bdb9-bd8d206c80bc): Get \"https://192.168.111.249:6443/api/v1/nodes?labelSelector=metal3.io%2Fuuid%3D93da9a56-c483-48c3-bdb9-bd8d206c80bc\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:24.360417 1 metal3machine_manager.go:1295] controllers/Metal3Machine/Metal3Machine-controller "msg"="error retrieving node, requeuing" "cluster"="test1" "machine"="test1-rbbds" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-rrspl"} I0104 16:53:33.791593 1 metal3labelsync_controller.go:142] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3","Name":"node-7"}
$ kubectl logs -n capm3-system ipam-controller-manager-68bb9b98b4-jrshs
I0104 16:49:23.796671 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"provisioning-pool"} I0104 16:49:23.809212 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"baremetalv4-pool"} I0104 16:49:23.890712 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"provisioning-pool"} I0104 16:49:23.892852 1 ippool_manager.go:106] controllers/IPPool/IPPool-controller "msg"="Fetching IPAddress objects" "metal3-ippool"={"Namespace":"metal3","Name":"baremetalv4-pool"}
# cat /etc/os-release
NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.3 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
@faisalchishtii thanks.
I noticed errors in KCP logs:
E0104 16:48:13.900424 1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster "metal3/test1": Get "https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "cluster"="test1" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane" E0104 16:48:13.905605 1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster "metal3/test1": Get "https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
Looks like the API server is not reachable, can you check please if Kubelet is running in the cluster and output of kubectl describe kcp <kcp-name> -n <namespace>
?
Also, can you confirm if you meet minimum requirements for the host where you are running m3-dev-env and have imported all required environment vars before running make
?
Since it is Ubuntu host, you should export:
export IMAGE_OS=Ubuntu
export CONTAINER_RUNTIME=docker
export EPHEMERAL_CLUSTER=kind
and API versions for CAPI/CAPM3 you would like to set, otherwise, they are set as default (v1beta1 for both env vars CAPI_VERSION/CAPM3_VERSION
).
Thank you @furkatgofurov7
I cleaned up using make clean
and retried after exporting the variables as you suggested but the API server is still unreachable.
cluster: error creating dynamic rest mapper for remote cluster \"metal3/test1\": Get \"https://192.168.111.249:6443/api?timeout=10s\": dial tcp 192.168.111.249:6443: connect: no route to host" "name"="test1" "namespace"="metal3" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
The kubelet on the management cluster (kind cluster) is running okay.
$ps -ef | grep kubelet
root 1058290 1057488 11 09:29 ? 00:00:01 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --fail-swap-on=false --node-ip=172.20.0.2 --node-labels= --pod-infra-container-image=k8s.gcr.io/pause:3.5 --provider-id=kind://docker/kind/kind-control-plane --fail-swap-on=false --cgroup-root=/kubele
The kubelet on the test1-gpjn6
machine is failing to start:
[metal3@test1-gpjn6 ~]$ sudo tail -f /var/log/messages
Jan 6 08:18:39 test1-gpjn6 kubelet[6761]: E0106 08:18:39.219920 6761 server.go:206] "Failed to load kubelet config file" err="failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory" path="/var/lib/kubelet/config.yaml"
Jan 6 08:18:39 test1-gpjn6 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jan 6 08:18:39 test1-gpjn6 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jan 6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
Jan 6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 85.
Jan 6 08:18:49 test1-gpjn6 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan 6 08:18:49 test1-gpjn6 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 6 08:18:49 test1-gpjn6 kubelet[6778]: E0106 08:18:49.444436 6778 server.go:206] "Failed to load kubelet config file" err="failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory" path="/var/lib/kubelet/config.yaml"
Jan 6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jan 6 08:18:49 test1-gpjn6 systemd[1]: kubelet.service: Failed with result 'exit-code'.
It looks like to be stuck when running the kubeadm init
thingy. I checked the cloud-init logs and seems like it is failing due to timeout errors. See complete cloud-init log below.
$ kubectl get kcp -A
NAMESPACE NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
metal3 test1 test1 1 1 1 11h v1.22.3
adam@nodek9:~$ kubectl describe kcp test1 -n metal3
Name: test1
Namespace: metal3
Labels: cluster.x-k8s.io/cluster-name=test1
Annotations: <none>
API Version: controlplane.cluster.x-k8s.io/v1beta1
Kind: KubeadmControlPlane
Metadata:
Creation Timestamp: 2022-01-05T15:45:45Z
Finalizers:
kubeadm.controlplane.cluster.x-k8s.io
Generation: 1
Managed Fields:
API Version: controlplane.cluster.x-k8s.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:kubeadmConfigSpec:
.:
f:clusterConfiguration:
f:files:
f:initConfiguration:
.:
f:nodeRegistration:
.:
f:kubeletExtraArgs:
.:
f:cgroup-driver:
f:container-runtime:
f:container-runtime-endpoint:
f:feature-gates:
f:node-labels:
f:provider-id:
f:runtime-request-timeout:
f:name:
f:joinConfiguration:
.:
f:controlPlane:
f:nodeRegistration:
.:
f:kubeletExtraArgs:
.:
f:cgroup-driver:
f:container-runtime:
f:container-runtime-endpoint:
f:feature-gates:
f:node-labels:
f:provider-id:
f:runtime-request-timeout:
f:name:
f:postKubeadmCommands:
f:preKubeadmCommands:
f:users:
f:machineTemplate:
.:
f:infrastructureRef:
.:
f:apiVersion:
f:kind:
f:name:
f:nodeDrainTimeout:
f:replicas:
f:rolloutStrategy:
.:
f:rollingUpdate:
.:
f:maxSurge:
f:version:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-01-05T15:45:45Z
API Version: controlplane.cluster.x-k8s.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"kubeadm.controlplane.cluster.x-k8s.io":
f:labels:
.:
f:cluster.x-k8s.io/cluster-name:
f:ownerReferences:
.:
k:{"uid":"228583f0-855a-4913-bf32-1316b16260fb"}:
Manager: manager
Operation: Update
Time: 2022-01-05T15:45:48Z
API Version: controlplane.cluster.x-k8s.io/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:conditions:
f:observedGeneration:
f:replicas:
f:selector:
f:unavailableReplicas:
f:updatedReplicas:
Manager: manager
Operation: Update
Subresource: status
Time: 2022-01-05T15:46:00Z
Owner References:
API Version: cluster.x-k8s.io/v1beta1
Block Owner Deletion: true
Controller: true
Kind: Cluster
Name: test1
UID: 228583f0-855a-4913-bf32-1316b16260fb
Resource Version: 18293
UID: 308cdc39-4e60-4c34-9776-c24a61bdfcda
Spec:
Kubeadm Config Spec:
Cluster Configuration:
API Server:
Controller Manager:
Dns:
Etcd:
Networking:
Scheduler:
Files:
Content: #!/bin/bash
set -e
url="$1"
dst="$2"
filename="$(basename $url)"
tmpfile="/tmp/$filename"
curl -sSL -w "%{http_code}" "$url" | sed "s:/usr/bin:/usr/local/bin:g" > /tmp/"$filename"
http_status=$(cat "$tmpfile" | tail -n 1)
if [ "$http_status" != "200" ]; then
echo "Error: unable to retrieve $filename file";
exit 1;
else
cat "$tmpfile"| sed '$d' > "$dst";
fi
Owner: root:root
Path: /usr/local/bin/retrieve.configuration.files.sh
Permissions: 0755
Content: #!/bin/bash
while :; do
curl -sk https://127.0.0.1:6443/healthz 1>&2 > /dev/null
isOk=$?
isActive=$(systemctl show -p ActiveState keepalived.service | cut -d'=' -f2)
if [ $isOk == "0" ] && [ $isActive != "active" ]; then
logger 'API server is healthy, however keepalived is not running, starting keepalived'
echo 'API server is healthy, however keepalived is not running, starting keepalived'
sudo systemctl start keepalived.service
elif [ $isOk != "0" ] && [ $isActive == "active" ]; then
logger 'API server is not healthy, however keepalived running, stopping keepalived'
echo 'API server is not healthy, however keepalived running, stopping keepalived'
sudo systemctl stop keepalived.service
fi
sleep 5
done
Owner: root:root
Path: /usr/local/bin/monitor.keepalived.sh
Permissions: 0755
Content: [Unit]
Description=Monitors keepalived adjusts status with that of API server
After=syslog.target network-online.target
[Service]
Type=simple
Restart=always
ExecStart=/usr/local/bin/monitor.keepalived.sh
[Install]
WantedBy=multi-user.target
Owner: root:root
Path: /lib/systemd/system/monitor.keepalived.service
Content: ! Configuration File for keepalived
global_defs {
notification_email {
sysadmin@example.com
support@example.com
}
notification_email_from lb@example.com
smtp_server localhost
smtp_connect_timeout 30
}
vrrp_instance VI_1 {
state MASTER
interface eth1
virtual_router_id 1
priority 101
advert_int 1
virtual_ipaddress {
192.168.111.249
}
}
Path: /etc/keepalived/keepalived.conf
Content: BOOTPROTO=none
DEVICE=eth0
ONBOOT=yes
TYPE=Ethernet
USERCTL=no
BRIDGE=ironicendpoint
Owner: root:root
Path: /etc/sysconfig/network-scripts/ifcfg-eth0
Permissions: 0644
Content: TYPE=Bridge
DEVICE=ironicendpoint
ONBOOT=yes
USERCTL=no
BOOTPROTO="static"
IPADDR={{ ds.meta_data.provisioningIP }}
PREFIX={{ ds.meta_data.provisioningCIDR }}
Owner: root:root
Path: /etc/sysconfig/network-scripts/ifcfg-ironicendpoint
Permissions: 0644
Content: [kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=0
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
Owner: root:root
Path: /etc/yum.repos.d/kubernetes.repo
Permissions: 0644
Content: [registries.search]
registries = ['docker.io']
[registries.insecure]
registries = ['192.168.111.1:5000']
Path: /etc/containers/registries.conf
Init Configuration:
Local API Endpoint:
Node Registration:
Kubelet Extra Args:
Cgroup - Driver: systemd
Container - Runtime: remote
Container - Runtime - Endpoint: unix:///var/run/crio/crio.sock
Feature - Gates: AllAlpha=false
Node - Labels: metal3.io/uuid={{ ds.meta_data.uuid }}
Provider - Id: metal3://{{ ds.meta_data.uuid }}
Runtime - Request - Timeout: 5m
Name: {{ ds.meta_data.name }}
Join Configuration:
Control Plane:
Local API Endpoint:
Discovery:
Node Registration:
Kubelet Extra Args:
Cgroup - Driver: systemd
Container - Runtime: remote
Container - Runtime - Endpoint: unix:///var/run/crio/crio.sock
Feature - Gates: AllAlpha=false
Node - Labels: metal3.io/uuid={{ ds.meta_data.uuid }}
Provider - Id: metal3://{{ ds.meta_data.uuid }}
Runtime - Request - Timeout: 5m
Name: {{ ds.meta_data.name }}
Post Kubeadm Commands:
mkdir -p /home/metal3/.kube
cp /etc/kubernetes/admin.conf /home/metal3/.kube/config
chown metal3:metal3 /home/metal3/.kube/config
Pre Kubeadm Commands:
systemctl restart NetworkManager.service
ifup eth0
systemctl enable --now crio keepalived kubelet
systemctl link /lib/systemd/system/monitor.keepalived.service
systemctl enable monitor.keepalived.service
systemctl start monitor.keepalived.service
Users:
Name: metal3
Ssh Authorized Keys:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDgLACiOeo0CyvVCbVVlL971th+w3uKvMskhSsPIkkBwQl+g+/QnfSxZitK/Ihzpiom5jkOA5T7GDm7mm+xvk+55lDa7ahm90sutD4/SaDw7ylLEUQc4EF20MtddckfpkazDKRZx0Yt6PJxN54INsDxDz3D9bklXEGzjh1OBFw08ayWQC+f4JZ9FUvIMaemAWRzCdBgeiECzIZZhB4d/fgFSGMG/RUG+sXAAHRXY0/oTGmlVvCzJsPy4yoR3rKxONzmyOK1qfHRHqr+pCzKD9abV7A9iyIiDwlTUclNYRCHE2laiYcGjje9IZAr/sPzmMYwwu+yvyxcLEPtXrHo/nyyGAc0TBu676tioMc1NnjLMcE/bRhI+zeCw0qhqFR6XTx0oMxrf8ofhJjC4t4RDvmqRDxV9gmcRBvLRJe82YAlHAX8GRB3hfLtpnxYnrFT/SqX+lyJspyinauNoU9OWB2ejCbEf0V4Ot8j/AlN7Fkr4A1PNV3qV001YEXDi0j4nR0= adam@nodek9
Sudo: ALL=(ALL) NOPASSWD:ALL
Machine Template:
Infrastructure Ref:
API Version: infrastructure.cluster.x-k8s.io/v1beta1
Kind: Metal3MachineTemplate
Name: test1-controlplane
Namespace: metal3
Metadata:
Node Drain Timeout: 0s
Replicas: 1
Rollout Strategy:
Rolling Update:
Max Surge: 1
Type: RollingUpdate
Version: v1.22.3
Status:
Conditions:
Last Transition Time: 2022-01-05T15:48:16Z
Message: 1 of 2 completed
Reason: DrainingFailed @ Machine/test1-qgkdh
Severity: Error
Status: False
Type: Ready
Last Transition Time: 2022-01-05T15:45:50Z
Reason: WaitingForKubeadmInit
Severity: Info
Status: False
Type: Available
Last Transition Time: 2022-01-05T15:45:50Z
Status: True
Type: CertificatesAvailable
Last Transition Time: 2022-01-05T15:45:57Z
Status: True
Type: MachinesCreated
Last Transition Time: 2022-01-05T15:48:16Z
Message: 1 of 2 completed
Reason: DrainingFailed @ Machine/test1-qgkdh
Severity: Error
Status: False
Type: MachinesReady
Last Transition Time: 2022-01-05T15:46:00Z
Status: True
Type: Resized
Observed Generation: 1
Replicas: 1
Selector: cluster.x-k8s.io/cluster-name=test1,cluster.x-k8s.io/control-plane
Unavailable Replicas: 1
Updated Replicas: 1
Events: <none>
It looks like to be stuck when running the
kubeadm init
thingy. I checked the cloud-init logs and seems like it is failing due to timeout errors. See complete cloud-init log below.
@faisalchishtii this is helpful, thanks. As predicted, for some reason, kubelet is not running properly on that machine and from the cloud-init logs you have attached I found this:
error execution phase preflight: [preflight] Some fatal errors occurred: [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.22.3: output: time="2022-01-06T09:44:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:60557->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.22.3: output: time="2022-01-06T09:49:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:56166->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.22.3: output: time="2022-01-06T09:54:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:32973->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.22.3: output: time="2022-01-06T09:59:13Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:50290->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.5: output: time="2022-01-06T10:04:14Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:36564->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.5.0-0: output: time="2022-01-06T10:09:14Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:43208->8.8.4.4:53: i/o timeout" , error: exit status 1 [ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.4: output: time="2022-01-06T10:14:14Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = pinging container registry k8s.gcr.io: Get \"https://k8s.gcr.io/v2/\": dial tcp: lookup k8s.gcr.io on 8.8.4.4:53: read udp 192.168.111.100:46906->8.8.4.4:53: i/o timeout" , error: exit status 1
While doing kubeadm init
in the machine, it failed to pull images from the k8s container registry and that caused the kubeadm failure. Also, the error message from the kubelet logs complaining with:
Failed to load kubelet config file" err="failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory" path="/var/lib/kubelet/config.yaml"
is a well-known issue reported several times in upstream k8s repositories, i.e (https://github.com/kubernetes/kubernetes/issues/73779, https://github.com/kubernetes/kubernetes/issues/65863). When kubeadm init/join
is called /var/lib/kubelet/config.yaml file should be created and the kubelet should start properly. Without that file, it is failing and the reason for not being able to create that file could be not being able to pull images from k8s container registry.
That being said, it does not seem to be a metal3 issue but rather an environment issue where you are running make
. Next question is, are you by any chance behind a corporate proxy? Could you try the solutions suggested on this thread to fix pulling images during kubeadm init and once we make sure it can pull images properly, re-run metal3-dev-env again?
Also, you can verify if all needed images can be pulled successfully by running: kubeadm config images pull
It turned out that the dns resolution was failing on the target machines/bmh hosts (centos vms created by libvirt) even though it is configured correctly. While I couldn't understand the root cause, I was able to temporarily fix it by manually adding IP's for k8s.gcr.io
and few other domains in /etc/hosts
.
This is how my /etc/hosts
looks like to make it to work temporarily:
$ sudo cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 airship-ci-centos-node-img-a85794d843
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
142.251.10.82 k8s.gcr.io
34.107.204.206 dl.k8s.io
142.250.193.132 googleapis.com
172.217.160.144 storage.googleapis.com
Thanks for all the help @furkatgofurov7
It turned out that the dns resolution was failing on the target machines/bmh hosts (centos vms created by libvirt) even though it is configured correctly. While I couldn't understand the root cause, I was able to temporarily fix it by manually adding IP's for
k8s.gcr.io
and few other domains in/etc/hosts
.This is how my
/etc/hosts
looks like to make it to work temporarily:$ sudo cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 airship-ci-centos-node-img-a85794d843 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 142.251.10.82 k8s.gcr.io 34.107.204.206 dl.k8s.io 142.250.193.132 googleapis.com 172.217.160.144 storage.googleapis.com
Thanks for all the help @furkatgofurov7
No problem. Closing this since it has been a problem with an environment, please feel free to reopen it if you want to discuss more on this issue.
/close
@furkatgofurov7: Closing this issue.
I ran the
make
command it completed okay.Then I tried to provision cluster using below scripts:
These completed too but the two machines are stuck in
Pending
andProvisioning
state.