Closed Zarquan closed 4 months ago
Where are you deploying from, and how? I can review log files from deployment and openstack logs on the system. It would be useful to have the timestamps of an attempted deployment.
However, given the timing it is likely the Openstack upgrade to Antelope is a confounding factor, or at least a causative one. We've still awaiting confirmation of completion.
I can run the deploy script whenever is convenient for you. Takes about 10 min to run the full delete-all and create-all sequence. We can coordinate tests via Slack ?
bootstrap node, control plane, and worker nodes all deploy successfully and are responsive in horizon. Console logs suggest kubernetes cluster initialised on control plane and joined by worker nodes successfully. Looks like perhaps a communication error between bootstrap node and other nodes blocking healthcheck pings.
[ 111.857437] cloud-init[1022]: [2024-01-17 15:09:49] [bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[ 111.857679] cloud-init[1022]: [2024-01-17 15:09:50] [bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[ 111.857898] cloud-init[1022]: [2024-01-17 15:09:50] [kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[ 111.858123] cloud-init[1022]: [2024-01-17 15:09:53] [addons] Applied essential addon: CoreDNS
[ 111.858357] cloud-init[1022]: [2024-01-17 15:09:54] [addons] Applied essential addon: kube-proxy
[ 111.858636] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.858813] cloud-init[1022]: [2024-01-17 15:09:54] Your Kubernetes control-plane has initialized successfully!
[ 111.859046] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.859302] cloud-init[1022]: [2024-01-17 15:09:54] To start using your cluster, you need to run the following as a regular user:
[ 111.859548] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.859779] cloud-init[1022]: [2024-01-17 15:09:54] mkdir -p $HOME/.kube
[ 111.860024] cloud-init[1022]: [2024-01-17 15:09:54] sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[ 111.860242] cloud-init[1022]: [2024-01-17 15:09:54] sudo chown $(id -u):$(id -g) $HOME/.kube/config
[ 111.860625] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.860923] cloud-init[1022]: [2024-01-17 15:09:54] Alternatively, if you are the root user, you can run:
[ 111.861159] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.861382] cloud-init[1022]: [2024-01-17 15:09:54] export KUBECONFIG=/etc/kubernetes/admin.conf
[ 111.861632] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.861870] cloud-init[1022]: [2024-01-17 15:09:54] You should now deploy a pod network to the cluster.
[ 111.862123] cloud-init[1022]: [2024-01-17 15:09:54] Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
[ 111.862347] cloud-init[1022]: [2024-01-17 15:09:54] https://kubernetes.io/docs/concepts/cluster-administration/addons/
[ 111.862591] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.862811] cloud-init[1022]: [2024-01-17 15:09:54] You can now join any number of control-plane nodes by copying certificate authorities
[ 111.863053] cloud-init[1022]: [2024-01-17 15:09:54] and service account keys on each node and then running the following as root:
[ 111.863270] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.863492] cloud-init[1022]: [2024-01-17 15:09:54] kubeadm join 192.41.122.22:6443 --token fjcs87.4n0vh6ffzqcdly3p \
[ 111.863968] cloud-init[1022]: [2024-01-17 15:09:54] --discovery-token-ca-cert-hash sha256:36de6c58705d63ee7218d8d1826afa03902384207357cbcef26e2c71960d9ccb \
[ 111.934370] cloud-init[1022]: [2024-01-17 15:09:54] --control-plane
[ 111.934537] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.934762] cloud-init[1022]: [2024-01-17 15:09:54] Then you can join any number of worker nodes by running the following on each as root:
[ 111.935001] cloud-init[1022]: [2024-01-17 15:09:54]
[ 111.935279] cloud-init[1022]: [2024-01-17 15:09:54] kubeadm join 192.41.122.22:6443 --token fjcs87.4n0vh6ffzqcdly3p \
[ 111.935557] cloud-init[1022]: [2024-01-17 15:09:54] --discovery-token-ca-cert-hash sha256:36de6c58705d63ee7218d8d1826afa03902384207357cbcef26e2c71960d9ccb
[ 111.935811] cloud-init[1022]: [2024-01-17 15:09:54] Cloud-init v. 23.3.1-0ubuntu1~22.04.1 finished at Wed, 17 Jan 2024 15:09:54 +0000. Datasource DataSourceOpenStackLocal [net,ver=2]. Up 111.66 seconds
ci-info: | ssh-rsa | d8:69:0c:45:8a:d4:e8:ab:36:1a:a3:ad:54:33:aa:73:99:ea:e6:80:58:45:88:60:de:7e:63:cd:c9:9d:2e:c2 | - | nch@roe.ac.uk |
ci-info: | ssh-rsa | 09:6d:38:28:2e:3a:58:8e:76:9a:c8:af:5b:f5:b7:c6:08:46:aa:90:2c:8c:5f:32:83:fa:93:fc:d5:c5:78:63 | - | a.krause@epcc.ed.ac.uk |
ci-info: +---------+-------------------------------------------------------------------------------------------------+---------+------------------------+
<14>Jan 17 15:31:26 cloud-init: #############################################################
<14>Jan 17 15:31:26 cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
<14>Jan 17 15:31:26 cloud-init: 1024 SHA256:iGX/l5kVE3yD0apUO8DpBKaOqBMMEWe4bB02FowacZI root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (DSA)
<14>Jan 17 15:31:26 cloud-init: 256 SHA256:/ctN/C6gOHzTfP1pwq7K2qg12NbXwXJWix+HyfmUQ3Q root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (ECDSA)
<14>Jan 17 15:31:26 cloud-init: 256 SHA256:PLurhIEdcxOZDj0lxlLNn1+/uEn6Vvl7Cl43d+n5IBk root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (ED25519)
<14>Jan 17 15:31:26 cloud-init: 3072 SHA256:7Trt6N15vqfqBr6UrktTSvxDm8l0VcBpeblUuv2mXCw root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (RSA)
<14>Jan 17 15:31:26 cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
<14>Jan 17 15:31:26 cloud-init: #############################################################
-----BEGIN SSH HOST KEY KEYS-----
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIrGTfmOdreqNILWQfjI9SoY5Y9Ysc7GHzBmEqpIloOXdU+98EI6DpsueVJgKTSu9HF5ePYXEZeMfLEUfIv9U3g= root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFig6RPd9Pm/ggLMy0zEYPhTkx7UM2ysnYfgbl2Qu/5N root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCzAtVoUfbVYfcYFYiNanrB1Ra3TBwZg3wthPz8zoMM8JjksEeGHUJQSADEwSQfc1vevyXXgcLHBkslw8T46R1XWpXyf9glF/JJkTuKkz3iFvE4SozccJYqejal7GkfXFrl3TVbcX+eq+s0kjyby5oGT+gaCki6OUnFYIkoBXosKiQsGMTqUev8PK7bRqLu/FO9RQvveO6vhRonb/lUU+vfuVMnYTJszdJbrqibdTkf74Xb06Af8YYdSdq5VmHZLTnJeJWEiyLR8qF6nwPhgjqwxa7x8+pyOJd21eKojryWOVW/15DKtdjDTpoPEgAabeftItwMghxR1K7jhJvmWr7vZruAQWyc5sU1A3ZjwLc/KDygnVWUrqZRABbsPwgRpM3ho3N3RLAxkEkZYBtliLLjfYKD0waCXcazz+lo+b6nb3gQ2Asja4GmC/3meSoONTmYFgBeMffvyVpAe2fVSibOhpeX/rM/04LNeIDSwl5AKVPcqTxXO7IECBBiJ9t3xmM= root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs
-----END SSH HOST KEY KEYS-----
[[0;32m OK [0m] Finished [0;1;39mExecute cloud user/final scripts[0m.
[[0;32m OK [0m] Reached target [0;1;39mCloud-init target[0m.
[ 63.501943] cloud-init[1023]: [2024-01-17 15:31:06] Cloud-init v. 23.3.1-0ubuntu1~22.04.1 running 'modules:final' at Wed, 17 Jan 2024 15:31:06 +0000. Up 42.81 seconds.
[ 63.518593] cloud-init[1023]: [2024-01-17 15:31:09] [preflight] Running pre-flight checks
[ 63.520202] cloud-init[1023]: [2024-01-17 15:31:10] [preflight] Reading configuration from the cluster...
[ 63.521839] cloud-init[1023]: [2024-01-17 15:31:10] [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[ 63.524296] cloud-init[1023]: [2024-01-17 15:31:10] [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[ 63.526315] cloud-init[1023]: [2024-01-17 15:31:10] [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[ 63.528634] cloud-init[1023]: [2024-01-17 15:31:10] [kubelet-start] Starting the kubelet
[ 63.530045] cloud-init[1023]: [2024-01-17 15:31:11] [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[ 63.531878] cloud-init[1023]: [2024-01-17 15:31:26]
[ 63.532863] cloud-init[1023]: [2024-01-17 15:31:26] This node has joined the cluster:
[ 63.534259] cloud-init[1023]: [2024-01-17 15:31:26] * Certificate signing request was sent to apiserver and a response was received.
[ 63.536392] cloud-init[1023]: [2024-01-17 15:31:26] * The Kubelet was informed of the new secure connection details.
[ 63.538174] cloud-init[1023]: [2024-01-17 15:31:26]
[ 63.539146] cloud-init[1023]: [2024-01-17 15:31:26] Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
[ 63.541210] cloud-init[1023]: [2024-01-17 15:31:26]
[ 63.542135] cloud-init[1023]: [2024-01-17 15:31:27] Cloud-init v. 23.3.1-0ubuntu1~22.04.1 finished at Wed, 17 Jan 2024 15:31:26 +0000. Datasource DataSourceOpenStackLocal [net,ver=2]. Up 63.40 seconds
Suggested regenerating application credentials used and retrying.
Not sure I understand the mechanism here. Broken or out of date credentials would cause errors when creating nodes, not communication errors between nodes. Anyway - created a new set of app credentials and ran through the full delete-all, create-all sequence.
Similar results:
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/somerville-jade-20240118-work False Warning ScalingUp 27m Scaling up control plane to 3 replicas (actual 1)
├─ClusterInfrastructure - OpenStackCluster/somerville-jade-20240118-work
├─ControlPlane - KubeadmControlPlane/somerville-jade-20240118-work-control-plane False Warning ScalingUp 27m Scaling up control plane to 3 replicas (actual 1)
│ └─Machine/somerville-jade-20240118-work-control-plane-pfrpl False Warning NodeStartupTimeout 15m Node failed to report startup in 10m0s
└─Workers
└─MachineDeployment/somerville-jade-20240118-work-md-0 False Warning WaitingForAvailableMachines 29m Minimum availability requires 2 replicas, current 0 available
└─3 Machines... True 4m57s See somerville-jade-20240118-work-md-0-7nx46-4vlvv, somerville-jade-20240118-work-md-0-7nx46-8h4sh, ...
....
I0118 06:52:29.947730 1 machinehealthcheck_controller.go:433] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="default/somerville-jade-20240118-work-control-plane" namespace="default" name="somerville-jade-20240118-work-control-plane" reconcileID="2950f9d6-b0ca-43a5-9f2b-8b3d259f5cbd" Cluster="default/somerville-jade-20240118-work" target="default/somerville-jade-20240118-work-control-plane/somerville-jade-20240118-work-control-plane-pfrpl/" reason="NodeStartupTimeout" message="Node failed to report startup in 10m0s"
I0118 06:52:29.954896 1 recorder.go:104] "events: Machine default/somerville-jade-20240118-work-control-plane/somerville-jade-20240118-work-control-plane-pfrpl/ has been marked as unhealthy" type="Normal" object={"kind":"Machine","namespace":"default","name":"somerville-jade-20240118-work-control-plane-pfrpl","uid":"c898e1a1-4b62-4e16-8e9a-bbd9b7de022b","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"4829"} reason="MachineMarkedUnhealthy"
....
Is this the kind of issue that would be useful to discuss on a Kubernetes users' channel on Slack, or similar?
Problem self-resolved, possibly as a result of docker rate limits.
On repeat isue, check cni-calico helm release to see if in pending
state. Check pods on tenant cluster, kubectl describe any failing, check for reason: rate limits.
@Zarquan to retry tet with larger flavours.
Back to not working again :confused:
kubectl \
--kubeconfig "${kindclusterconf:?}" \
get helmrelease -A
NAMESPACE NAME CLUSTER BOOTSTRAP TARGET NAMESPACE RELEASE NAME PHASE REVISION CHART NAME CHART VERSION AGE
default somerville-jade-20240123-work-ccm-openstack somerville-jade-20240123-work true openstack-system ccm-openstack Deployed 1 openstack-cloud-controller-manager 1.3.0 16m
default somerville-jade-20240123-work-cni-calico somerville-jade-20240123-work true tigera-operator cni-calico Deployed 1 tigera-operator v3.26.0 16m
default somerville-jade-20240123-work-csi-cinder somerville-jade-20240123-work true openstack-system csi-cinder Installing openstack-cinder-csi 2.2.0 16m
default somerville-jade-20240123-work-kubernetes-dashboard somerville-jade-20240123-work true kubernetes-dashboard kubernetes-dashboard Deployed 1 kubernetes-dashboard 5.10.0 16m
default somerville-jade-20240123-work-mellanox-network-operator somerville-jade-20240123-work true network-operator mellanox-network-operator Installing network-operator 1.3.0 16m
default somerville-jade-20240123-work-metrics-server somerville-jade-20240123-work true kube-system metrics-server Installing metrics-server 3.8.2 16m
default somerville-jade-20240123-work-node-feature-discovery somerville-jade-20240123-work true node-feature-discovery node-feature-discovery Installing node-feature-discovery 0.11.2 16m
default somerville-jade-20240123-work-nvidia-gpu-operator somerville-jade-20240123-work true gpu-operator nvidia-gpu-operator Installing gpu-operator v1.11.1 16m
kubectl \
--kubeconfig "${kindclusterconf:?}" \
get pods \
--all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-7db568c844-kmbmt 1/1 Running 0 29m
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-7f9b558f5c-5r2mm 1/1 Running 0 29m
capi-system capi-controller-manager-76955c46b9-drqph 1/1 Running 0 29m
capo-system capo-controller-manager-544cb69b9d-njccn 1/1 Running 0 29m
cert-manager cert-manager-66d9545484-5mf9c 1/1 Running 0 31m
cert-manager cert-manager-cainjector-7d8b6bd6fb-p89lw 1/1 Running 0 31m
cert-manager cert-manager-webhook-669b96dcfd-wq5zt 1/1 Running 0 31m
default cluster-api-addon-provider-66cc76bbbf-jmlq2 1/1 Running 0 29m
default somerville-jade-20240123-work-autoscaler-59658d94b6-jxx6d 0/1 CrashLoopBackOff 9 (2m15s ago) 27m
kube-system coredns-5d78c9869d-hrp6n 1/1 Running 0 31m
kube-system coredns-5d78c9869d-nt956 1/1 Running 0 31m
kube-system etcd-somerville-jade-20240123-kind-control-plane 1/1 Running 0 32m
kube-system kindnet-s2f52 1/1 Running 0 31m
kube-system kube-apiserver-somerville-jade-20240123-kind-control-plane 1/1 Running 0 31m
kube-system kube-controller-manager-somerville-jade-20240123-kind-control-plane 1/1 Running 0 32m
kube-system kube-proxy-nz7xd 1/1 Running 0 31m
kube-system kube-scheduler-somerville-jade-20240123-kind-control-plane 1/1 Running 0 32m
local-path-storage local-path-provisioner-6bc4bddd6b-58fs2 1/1 Running 0 31m
kubectl \
--kubeconfig "${workclusterconf:?}" \
get pods \
--all-namespaces
E0123 12:04:43.708937 14021 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0123 12:04:43.744442 14021 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0123 12:04:43.749007 14021 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0123 12:04:43.753613 14021 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-system calico-kube-controllers-6f994d9b59-z7fhs 0/1 Pending 0 23m
calico-system calico-node-5jdwc 1/1 Running 0 3m14s
calico-system calico-node-h7gcn 1/1 Running 0 2m59s
calico-system calico-node-mqgjj 1/1 Running 0 22m
calico-system calico-node-mvtbk 0/1 CrashLoopBackOff 6 (6m2s ago) 13m
calico-system calico-node-nkmtb 1/1 Running 0 3m15s
calico-system calico-node-p5vc8 1/1 Running 0 11m
calico-system calico-node-pzbgq 1/1 Running 0 23m
calico-system calico-node-qkdjb 1/1 Running 0 22m
calico-system calico-node-rj7jk 1/1 Running 0 11m
calico-system calico-node-z8fvd 1/1 Running 0 22m
calico-system calico-typha-588d9ffd8c-6dfw8 1/1 Running 0 22m
calico-system calico-typha-588d9ffd8c-b5mmg 1/1 Running 0 13m
calico-system calico-typha-588d9ffd8c-btgbm 1/1 Running 0 23m
calico-system csi-node-driver-8kxmh 2/2 Running 0 22m
calico-system csi-node-driver-8xmbj 2/2 Running 0 3m14s
calico-system csi-node-driver-9dmxv 0/2 ContainerCreating 0 13m
calico-system csi-node-driver-cztgj 2/2 Running 0 3m15s
calico-system csi-node-driver-ktngs 2/2 Running 0 22m
calico-system csi-node-driver-pg72j 2/2 Running 0 22m
calico-system csi-node-driver-rhx88 2/2 Running 0 11m
calico-system csi-node-driver-rp8l6 2/2 Running 0 11m
calico-system csi-node-driver-vs6nf 2/2 Running 0 2m59s
calico-system csi-node-driver-zchr5 2/2 Running 0 23m
gpu-operator gpu-operator-6c8649c88c-x4pxp 0/1 Pending 0 22m
kube-system coredns-787d4945fb-fvbzn 0/1 Pending 0 24m
kube-system coredns-787d4945fb-lcvkz 0/1 Pending 0 24m
kube-system etcd-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb 1/1 Running 0 24m
kube-system kube-apiserver-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb 1/1 Running 0 24m
kube-system kube-controller-manager-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb 1/1 Running 0 24m
kube-system kube-proxy-2gv9n 1/1 Running 0 3m14s
kube-system kube-proxy-8g2kg 1/1 Running 0 22m
kube-system kube-proxy-dvcnv 1/1 Running 0 3m14s
kube-system kube-proxy-gsklp 1/1 Running 0 11m
kube-system kube-proxy-jmmf5 1/1 Running 0 22m
kube-system kube-proxy-mzb58 1/1 Running 0 13m
kube-system kube-proxy-pcnfv 1/1 Running 0 2m59s
kube-system kube-proxy-sm294 1/1 Running 0 11m
kube-system kube-proxy-wr85g 1/1 Running 0 24m
kube-system kube-proxy-xvb9n 1/1 Running 0 22m
kube-system kube-scheduler-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb 1/1 Running 0 24m
kube-system metrics-server-65cccfc7bb-lxcx6 0/1 Pending 0 24m
kubernetes-dashboard kubernetes-dashboard-85d67585b8-hfdk7 0/2 Pending 0 24m
network-operator mellanox-network-operator-5f7b6b766c-4cxnj 0/1 Pending 0 22m
node-feature-discovery node-feature-discovery-master-75c9d78d5f-f97c8 0/1 Pending 0 24m
node-feature-discovery node-feature-discovery-worker-25wcz 0/1 CrashLoopBackOff 7 (30s ago) 24m
node-feature-discovery node-feature-discovery-worker-48nzw 1/1 Running 1 (60s ago) 3m14s
node-feature-discovery node-feature-discovery-worker-48zgq 0/1 ContainerCreating 0 13m
node-feature-discovery node-feature-discovery-worker-8qdt7 1/1 Running 4 (15m ago) 22m
node-feature-discovery node-feature-discovery-worker-cgww4 1/1 Running 3 (15m ago) 22m
node-feature-discovery node-feature-discovery-worker-gml9h 1/1 Running 3 (15m ago) 22m
node-feature-discovery node-feature-discovery-worker-nh9fd 1/1 Running 4 (5m14s ago) 11m
node-feature-discovery node-feature-discovery-worker-s6lbz 1/1 Running 2 (31s ago) 3m14s
node-feature-discovery node-feature-discovery-worker-svhbl 1/1 Running 1 (25s ago) 2m59s
node-feature-discovery node-feature-discovery-worker-wfd77 0/1 Error 4 (5m53s ago) 11m
openstack-system openstack-cinder-csi-controllerplugin-b5bb59498-6hnp8 0/6 Pending 0 24m
openstack-system openstack-cinder-csi-nodeplugin-7t5mk 0/3 ContainerCreating 0 22m
openstack-system openstack-cinder-csi-nodeplugin-bj7c7 0/3 ContainerCreating 0 11m
openstack-system openstack-cinder-csi-nodeplugin-cnlrv 0/3 ContainerCreating 0 3m14s
openstack-system openstack-cinder-csi-nodeplugin-hwft8 0/3 ContainerCreating 0 3m14s
openstack-system openstack-cinder-csi-nodeplugin-j6czn 0/3 ContainerCreating 0 2m59s
openstack-system openstack-cinder-csi-nodeplugin-lpqd8 0/3 ContainerCreating 0 13m
openstack-system openstack-cinder-csi-nodeplugin-n46lq 0/3 ContainerCreating 0 11m
openstack-system openstack-cinder-csi-nodeplugin-rdqgl 0/3 ContainerCreating 0 22m
openstack-system openstack-cinder-csi-nodeplugin-rj9k8 0/3 ContainerCreating 0 22m
openstack-system openstack-cinder-csi-nodeplugin-tq5jc 0/3 ContainerCreating 0 24m
openstack-system openstack-cloud-controller-manager-wmhqr 0/1 ContainerCreating 0 21m
tigera-operator tigera-operator-7d4cfffc6-bv7gq 1/1 Running 0 23m
This is exactly the same configuration that worked on Thursday 18th Jan.
3 control nodes 6 workers seem to be working, but we don't know why https://github.com/Zarquan/gaia-dmp/blob/master/notes/zrq/20240125-01-jade-debug.txt
I'd like to keep this open while we run some more tests.
At this point in time, with the inconsistency of the test results, I'm leaning towards thinking it is less likely to be a Somerville system issue, and something with the configuration of project, instances, or clusterAPI. I've not used clusterAPI to deploy production instances, however I have had some difficulty in the past deploying kubernetes that presented with CNI failure. In your comment here:
https://github.com/lsst-uk/somerville-operations/issues/144#issuecomment-1905901841
I would look first at the failed CNI pod, calico-node-mvtbk, with kubectl describe and see what it is complaining of.
Unsure if it's similar to my own experience, but CNI pods I'd expect to be deployed first with few dependencies, so if they're not working you don't have to eliminate the rest of the deployment as a possible confounding factor.
The same codebase works on Cambridge Arcus.
The deployment code is the same every time. The deployment procedure is the same every time. The only thing we have been tweaking are the DNS server address, images, flavors and node counts.
Tests results show that explicit DNS server address is not needed. Tests results show that the new images and flavors are working. Tests results show that multiple control nodes may make it more prone to fail.
As of this morning we have the full set of 3 control nodes and 6 worker nodes all using the new flavors and images. but we don't know why
20240110 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240110-01-somerville.txt
20240111 Working deployment on Cambridge Arcus https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240111-01-arcus-k8s.txt
20240112 Working deployment on Cambridge Arcus https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240112-03-arcus-dns.txt
20240112 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240112-05-jade-dns.txt
20240115 Working deployment on Cambridge Arcus https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240115-03-arcus-dns.txt
20240116 Failed deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240116-01-jade-dns.txt
20240116 Failed deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240116-02-jade-dns.txt
20240116 Failed deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240116-03-jade-dns.txt
20240117 Failed deployment on Somerville - using known good configuration https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240117-01-jade-dns.txt
20240118 Failed deployment on Somerville - using new app credentials https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240118-01-credentials.txt
20240118 Working deployment on Somerville - as if by magic, no changes to the config https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240118-02-jade-magic.txt
20240118 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240118-03-jade-dns.txt
20240119 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240119-03-jade-dns.txt
20240123 Failed deployment on Somerville - same configuration as 20240119 https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240123-01-jade-debug.txt
20240124 Working deployment on Somerville - same configuration as 20240119 https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240124-01-jade-debug.txt
20240124 Working and failed deployment on Somerville - 1 control node works, 3 control nodes fails https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240124-02-jade-flavors.txt
20240125 Working deployment on Somerville - 1 control node works, 3 control nodes works https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240125-01-jade-debug.txt
On 2 occasions we have explicitly re-tested a previous configuration and got a different result. https://github.com/wfau/gaia-dmp/blob/ed7e8b8475f3cc51eabb10efd0fc86ab5651f702/notes/zrq/20240117-01-jade-dns.txt#L33-L43 https://github.com/wfau/gaia-dmp/blob/ed7e8b8475f3cc51eabb10efd0fc86ab5651f702/notes/zrq/20240123-01-jade-debug.txt#L33-L41
Something somewhere is flakey. I'll check the status of the CNI pods next time it fails.
Personally, I suspect this might be due to issues with the LoadBalancers in the Somerville system. However it is a non-trivial task to try to narrow this down and identify the cause.
What we need are some debug tools provided by StackHPC that can be pointed at kubectl
endpoint to run some basic unit tests and debug health checks to check for obvious issues with the underlying platform.
I've created an issue in the StackHPC capi-helm-charts GitHub project. If you think these tools would be useful for you please add a comment to boost the priority of the issue.
The StackHPC capi-helm-charts are supposed to make this kind of deployment quick and easy. If they work, they are fine. If the deploy doesn't work, the user is left wandering for days in a complex maze of things they didn't know they needed.
Deployment working on Monday 29th Jan. Collected some information about the load balancer, members and pools, for comparison if/when it fails.
It has stopped working again. Exactly the same configuration as the previous attempts. 20240217 FAIL - same configuration as above. 20240218 FAIL - same configuration as above.
Tried running the same deployment configuration this morning from an EIDF VM, seeing the same failures as reported by Dave.
Every 2.0s: clusterctl --kubeconfig /opt/aglais/somerville-jade-20240229-kind.yml describe cluster somerville-... somerville-jade-20240229-bootstrap-node.novalocal: Thu Feb 29 11:16:48 2024
NAME READY SEVERITY REASON SINCE MESSAGE
Cluster/somerville-jade-20240229-work False Warning ScalingUp 19m Scaling up control plane to 3 replicas (actual 1)
├─ClusterInfrastructure - OpenStackCluster/somerville-jade-20240229-work
├─ControlPlane - KubeadmControlPlane/somerville-jade-20240229-work-control-plane False Warning ScalingUp 19m Scaling up control plane to 3 replicas (actual 1)
│ └─Machine/somerville-jade-20240229-work-control-plane-rfqqm False Warning NodeStartupTimeout 7m40s Node failed to report startup in 10m0s
└─Workers
└─MachineDeployment/somerville-jade-20240229-work-md-0 False Warning WaitingForAvailableMachines 21m Minimum availability requires 5 replicas, current 0 ava
ilable
└─6 Machines... True 7m11s See somerville-jade-20240229-work-md-0-bbgwv-27lm2, som
erville-jade-20240229-work-md-0-bbgwv-5g576, ...
@millingw Can you run the deployment again and let us know if this fails?
This is based on my code for running the capi Helm charts, which could quite easily be wrong. If the CAPI/MagnumAPI interface is reliably creating Kubernetes clusters, then no need to chase this issue.
I don't know how to report this, other than to say that a Kubernetes deployment script that worked on Jan 10th now no longer works.
All the components get created correctly, network, subnet, router, load balancer and at least some of the virtual machines, but the virtual machines fail to connect to each other and get marked as 'unhealthy' blocking the creation of the Kubernetes cluster.
A second pair of eyes to help with figuring out why this is happening would be appreciated.