lsst-uk / somerville-operations

User issue reporting and tracking for the Somerville Cloud
0 stars 0 forks source link

Unknown issue blocking Kubernetes cluster #144

Closed Zarquan closed 4 months ago

Zarquan commented 10 months ago

I don't know how to report this, other than to say that a Kubernetes deployment script that worked on Jan 10th now no longer works.

All the components get created correctly, network, subnet, router, load balancer and at least some of the virtual machines, but the virtual machines fail to connect to each other and get marked as 'unhealthy' blocking the creation of the Kubernetes cluster.

A second pair of eyes to help with figuring out why this is happening would be appreciated.

GregBlow commented 10 months ago

Where are you deploying from, and how? I can review log files from deployment and openstack logs on the system. It would be useful to have the timestamps of an attempted deployment.

However, given the timing it is likely the Openstack upgrade to Antelope is a confounding factor, or at least a causative one. We've still awaiting confirmation of completion.

Zarquan commented 10 months ago

I can run the deploy script whenever is convenient for you. Takes about 10 min to run the full delete-all and create-all sequence. We can coordinate tests via Slack ?

GregBlow commented 10 months ago

bootstrap node, control plane, and worker nodes all deploy successfully and are responsive in horizon. Console logs suggest kubernetes cluster initialised on control plane and joined by worker nodes successfully. Looks like perhaps a communication error between bootstrap node and other nodes blocking healthcheck pings.

[  111.857437] cloud-init[1022]: [2024-01-17 15:09:49] [bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[  111.857679] cloud-init[1022]: [2024-01-17 15:09:50] [bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[  111.857898] cloud-init[1022]: [2024-01-17 15:09:50] [kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[  111.858123] cloud-init[1022]: [2024-01-17 15:09:53] [addons] Applied essential addon: CoreDNS
[  111.858357] cloud-init[1022]: [2024-01-17 15:09:54] [addons] Applied essential addon: kube-proxy
[  111.858636] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.858813] cloud-init[1022]: [2024-01-17 15:09:54] Your Kubernetes control-plane has initialized successfully!
[  111.859046] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.859302] cloud-init[1022]: [2024-01-17 15:09:54] To start using your cluster, you need to run the following as a regular user:
[  111.859548] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.859779] cloud-init[1022]: [2024-01-17 15:09:54]   mkdir -p $HOME/.kube
[  111.860024] cloud-init[1022]: [2024-01-17 15:09:54]   sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[  111.860242] cloud-init[1022]: [2024-01-17 15:09:54]   sudo chown $(id -u):$(id -g) $HOME/.kube/config
[  111.860625] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.860923] cloud-init[1022]: [2024-01-17 15:09:54] Alternatively, if you are the root user, you can run:
[  111.861159] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.861382] cloud-init[1022]: [2024-01-17 15:09:54]   export KUBECONFIG=/etc/kubernetes/admin.conf
[  111.861632] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.861870] cloud-init[1022]: [2024-01-17 15:09:54] You should now deploy a pod network to the cluster.
[  111.862123] cloud-init[1022]: [2024-01-17 15:09:54] Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
[  111.862347] cloud-init[1022]: [2024-01-17 15:09:54]   https://kubernetes.io/docs/concepts/cluster-administration/addons/
[  111.862591] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.862811] cloud-init[1022]: [2024-01-17 15:09:54] You can now join any number of control-plane nodes by copying certificate authorities
[  111.863053] cloud-init[1022]: [2024-01-17 15:09:54] and service account keys on each node and then running the following as root:
[  111.863270] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.863492] cloud-init[1022]: [2024-01-17 15:09:54]   kubeadm join 192.41.122.22:6443 --token fjcs87.4n0vh6ffzqcdly3p \
[  111.863968] cloud-init[1022]: [2024-01-17 15:09:54]  --discovery-token-ca-cert-hash sha256:36de6c58705d63ee7218d8d1826afa03902384207357cbcef26e2c71960d9ccb \
[  111.934370] cloud-init[1022]: [2024-01-17 15:09:54]  --control-plane
[  111.934537] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.934762] cloud-init[1022]: [2024-01-17 15:09:54] Then you can join any number of worker nodes by running the following on each as root:
[  111.935001] cloud-init[1022]: [2024-01-17 15:09:54]
[  111.935279] cloud-init[1022]: [2024-01-17 15:09:54] kubeadm join 192.41.122.22:6443 --token fjcs87.4n0vh6ffzqcdly3p \
[  111.935557] cloud-init[1022]: [2024-01-17 15:09:54]  --discovery-token-ca-cert-hash sha256:36de6c58705d63ee7218d8d1826afa03902384207357cbcef26e2c71960d9ccb
[  111.935811] cloud-init[1022]: [2024-01-17 15:09:54] Cloud-init v. 23.3.1-0ubuntu1~22.04.1 finished at Wed, 17 Jan 2024 15:09:54 +0000. Datasource DataSourceOpenStackLocal [net,ver=2].  Up 111.66 seconds
ci-info: | ssh-rsa | d8:69:0c:45:8a:d4:e8:ab:36:1a:a3:ad:54:33:aa:73:99:ea:e6:80:58:45:88:60:de:7e:63:cd:c9:9d:2e:c2 |    -    |     nch@roe.ac.uk      |
ci-info: | ssh-rsa | 09:6d:38:28:2e:3a:58:8e:76:9a:c8:af:5b:f5:b7:c6:08:46:aa:90:2c:8c:5f:32:83:fa:93:fc:d5:c5:78:63 |    -    | a.krause@epcc.ed.ac.uk |
ci-info: +---------+-------------------------------------------------------------------------------------------------+---------+------------------------+
<14>Jan 17 15:31:26 cloud-init: #############################################################
<14>Jan 17 15:31:26 cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
<14>Jan 17 15:31:26 cloud-init: 1024 SHA256:iGX/l5kVE3yD0apUO8DpBKaOqBMMEWe4bB02FowacZI root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (DSA)
<14>Jan 17 15:31:26 cloud-init: 256 SHA256:/ctN/C6gOHzTfP1pwq7K2qg12NbXwXJWix+HyfmUQ3Q root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (ECDSA)
<14>Jan 17 15:31:26 cloud-init: 256 SHA256:PLurhIEdcxOZDj0lxlLNn1+/uEn6Vvl7Cl43d+n5IBk root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (ED25519)
<14>Jan 17 15:31:26 cloud-init: 3072 SHA256:7Trt6N15vqfqBr6UrktTSvxDm8l0VcBpeblUuv2mXCw root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs (RSA)
<14>Jan 17 15:31:26 cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
<14>Jan 17 15:31:26 cloud-init: #############################################################
-----BEGIN SSH HOST KEY KEYS-----
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIrGTfmOdreqNILWQfjI9SoY5Y9Ysc7GHzBmEqpIloOXdU+98EI6DpsueVJgKTSu9HF5ePYXEZeMfLEUfIv9U3g= root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFig6RPd9Pm/ggLMy0zEYPhTkx7UM2ysnYfgbl2Qu/5N root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCzAtVoUfbVYfcYFYiNanrB1Ra3TBwZg3wthPz8zoMM8JjksEeGHUJQSADEwSQfc1vevyXXgcLHBkslw8T46R1XWpXyf9glF/JJkTuKkz3iFvE4SozccJYqejal7GkfXFrl3TVbcX+eq+s0kjyby5oGT+gaCki6OUnFYIkoBXosKiQsGMTqUev8PK7bRqLu/FO9RQvveO6vhRonb/lUU+vfuVMnYTJszdJbrqibdTkf74Xb06Af8YYdSdq5VmHZLTnJeJWEiyLR8qF6nwPhgjqwxa7x8+pyOJd21eKojryWOVW/15DKtdjDTpoPEgAabeftItwMghxR1K7jhJvmWr7vZruAQWyc5sU1A3ZjwLc/KDygnVWUrqZRABbsPwgRpM3ho3N3RLAxkEkZYBtliLLjfYKD0waCXcazz+lo+b6nb3gQ2Asja4GmC/3meSoONTmYFgBeMffvyVpAe2fVSibOhpeX/rM/04LNeIDSwl5AKVPcqTxXO7IECBBiJ9t3xmM= root@somerville-jade-20240117-work-md-0-c03ed516-xxqqs
-----END SSH HOST KEY KEYS-----
[[0;32m  OK  [0m] Finished [0;1;39mExecute cloud user/final scripts[0m.
[[0;32m  OK  [0m] Reached target [0;1;39mCloud-init target[0m.
[   63.501943] cloud-init[1023]: [2024-01-17 15:31:06] Cloud-init v. 23.3.1-0ubuntu1~22.04.1 running 'modules:final' at Wed, 17 Jan 2024 15:31:06 +0000. Up 42.81 seconds.
[   63.518593] cloud-init[1023]: [2024-01-17 15:31:09] [preflight] Running pre-flight checks
[   63.520202] cloud-init[1023]: [2024-01-17 15:31:10] [preflight] Reading configuration from the cluster...
[   63.521839] cloud-init[1023]: [2024-01-17 15:31:10] [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[   63.524296] cloud-init[1023]: [2024-01-17 15:31:10] [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[   63.526315] cloud-init[1023]: [2024-01-17 15:31:10] [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[   63.528634] cloud-init[1023]: [2024-01-17 15:31:10] [kubelet-start] Starting the kubelet
[   63.530045] cloud-init[1023]: [2024-01-17 15:31:11] [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[   63.531878] cloud-init[1023]: [2024-01-17 15:31:26]
[   63.532863] cloud-init[1023]: [2024-01-17 15:31:26] This node has joined the cluster:
[   63.534259] cloud-init[1023]: [2024-01-17 15:31:26] * Certificate signing request was sent to apiserver and a response was received.
[   63.536392] cloud-init[1023]: [2024-01-17 15:31:26] * The Kubelet was informed of the new secure connection details.
[   63.538174] cloud-init[1023]: [2024-01-17 15:31:26]
[   63.539146] cloud-init[1023]: [2024-01-17 15:31:26] Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
[   63.541210] cloud-init[1023]: [2024-01-17 15:31:26]
[   63.542135] cloud-init[1023]: [2024-01-17 15:31:27] Cloud-init v. 23.3.1-0ubuntu1~22.04.1 finished at Wed, 17 Jan 2024 15:31:26 +0000. Datasource DataSourceOpenStackLocal [net,ver=2].  Up 63.40 seconds
GregBlow commented 10 months ago

Suggested regenerating application credentials used and retrying.

Zarquan commented 9 months ago

Not sure I understand the mechanism here. Broken or out of date credentials would cause errors when creating nodes, not communication errors between nodes. Anyway - created a new set of app credentials and ran through the full delete-all, create-all sequence.

Similar results:

NAME                                                                              READY  SEVERITY  REASON                       SINCE  MESSAGE
Cluster/somerville-jade-20240118-work                                             False  Warning   ScalingUp                    27m    Scaling up control plane to 3 replicas (actual 1)
├─ClusterInfrastructure - OpenStackCluster/somerville-jade-20240118-work
├─ControlPlane - KubeadmControlPlane/somerville-jade-20240118-work-control-plane  False  Warning   ScalingUp                    27m    Scaling up control plane to 3 replicas (actual 1)
│ └─Machine/somerville-jade-20240118-work-control-plane-pfrpl                     False  Warning   NodeStartupTimeout           15m    Node failed to report startup in 10m0s
└─Workers
  └─MachineDeployment/somerville-jade-20240118-work-md-0                          False  Warning   WaitingForAvailableMachines  29m    Minimum availability requires 2 replicas, current 0 available
    └─3 Machines...                                                               True                                          4m57s  See somerville-jade-20240118-work-md-0-7nx46-4vlvv, somerville-jade-20240118-work-md-0-7nx46-8h4sh, ...
....
I0118 06:52:29.947730       1 machinehealthcheck_controller.go:433] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="default/somerville-jade-20240118-work-control-plane" namespace="default" name="somerville-jade-20240118-work-control-plane" reconcileID="2950f9d6-b0ca-43a5-9f2b-8b3d259f5cbd" Cluster="default/somerville-jade-20240118-work" target="default/somerville-jade-20240118-work-control-plane/somerville-jade-20240118-work-control-plane-pfrpl/" reason="NodeStartupTimeout" message="Node failed to report startup in 10m0s"
I0118 06:52:29.954896       1 recorder.go:104] "events: Machine default/somerville-jade-20240118-work-control-plane/somerville-jade-20240118-work-control-plane-pfrpl/ has been marked as unhealthy" type="Normal" object={"kind":"Machine","namespace":"default","name":"somerville-jade-20240118-work-control-plane-pfrpl","uid":"c898e1a1-4b62-4e16-8e9a-bbd9b7de022b","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"4829"} reason="MachineMarkedUnhealthy"
....
markgbeckett commented 9 months ago

Is this the kind of issue that would be useful to discuss on a Kubernetes users' channel on Slack, or similar?

GregBlow commented 9 months ago

Problem self-resolved, possibly as a result of docker rate limits.

On repeat isue, check cni-calico helm release to see if in pending state. Check pods on tenant cluster, kubectl describe any failing, check for reason: rate limits.

@Zarquan to retry tet with larger flavours.

Zarquan commented 9 months ago

Back to not working again :confused:

Zarquan commented 9 months ago
kubectl \
    --kubeconfig "${kindclusterconf:?}" \
    get helmrelease -A

NAMESPACE   NAME                                                      CLUSTER                         BOOTSTRAP   TARGET NAMESPACE         RELEASE NAME                PHASE        REVISION   CHART NAME                           CHART VERSION   AGE
default     somerville-jade-20240123-work-ccm-openstack               somerville-jade-20240123-work   true        openstack-system         ccm-openstack               Deployed     1          openstack-cloud-controller-manager   1.3.0           16m
default     somerville-jade-20240123-work-cni-calico                  somerville-jade-20240123-work   true        tigera-operator          cni-calico                  Deployed     1          tigera-operator                      v3.26.0         16m
default     somerville-jade-20240123-work-csi-cinder                  somerville-jade-20240123-work   true        openstack-system         csi-cinder                  Installing              openstack-cinder-csi                 2.2.0           16m
default     somerville-jade-20240123-work-kubernetes-dashboard        somerville-jade-20240123-work   true        kubernetes-dashboard     kubernetes-dashboard        Deployed     1          kubernetes-dashboard                 5.10.0          16m
default     somerville-jade-20240123-work-mellanox-network-operator   somerville-jade-20240123-work   true        network-operator         mellanox-network-operator   Installing              network-operator                     1.3.0           16m
default     somerville-jade-20240123-work-metrics-server              somerville-jade-20240123-work   true        kube-system              metrics-server              Installing              metrics-server                       3.8.2           16m
default     somerville-jade-20240123-work-node-feature-discovery      somerville-jade-20240123-work   true        node-feature-discovery   node-feature-discovery      Installing              node-feature-discovery               0.11.2          16m
default     somerville-jade-20240123-work-nvidia-gpu-operator         somerville-jade-20240123-work   true        gpu-operator             nvidia-gpu-operator         Installing              gpu-operator                         v1.11.1         16m
Zarquan commented 9 months ago
kubectl \
    --kubeconfig "${kindclusterconf:?}" \
    get pods \
        --all-namespaces

NAMESPACE                           NAME                                                                  READY   STATUS             RESTARTS        AGE
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-7db568c844-kmbmt            1/1     Running            0               29m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-7f9b558f5c-5r2mm        1/1     Running            0               29m
capi-system                         capi-controller-manager-76955c46b9-drqph                              1/1     Running            0               29m
capo-system                         capo-controller-manager-544cb69b9d-njccn                              1/1     Running            0               29m
cert-manager                        cert-manager-66d9545484-5mf9c                                         1/1     Running            0               31m
cert-manager                        cert-manager-cainjector-7d8b6bd6fb-p89lw                              1/1     Running            0               31m
cert-manager                        cert-manager-webhook-669b96dcfd-wq5zt                                 1/1     Running            0               31m
default                             cluster-api-addon-provider-66cc76bbbf-jmlq2                           1/1     Running            0               29m
default                             somerville-jade-20240123-work-autoscaler-59658d94b6-jxx6d             0/1     CrashLoopBackOff   9 (2m15s ago)   27m
kube-system                         coredns-5d78c9869d-hrp6n                                              1/1     Running            0               31m
kube-system                         coredns-5d78c9869d-nt956                                              1/1     Running            0               31m
kube-system                         etcd-somerville-jade-20240123-kind-control-plane                      1/1     Running            0               32m
kube-system                         kindnet-s2f52                                                         1/1     Running            0               31m
kube-system                         kube-apiserver-somerville-jade-20240123-kind-control-plane            1/1     Running            0               31m
kube-system                         kube-controller-manager-somerville-jade-20240123-kind-control-plane   1/1     Running            0               32m
kube-system                         kube-proxy-nz7xd                                                      1/1     Running            0               31m
kube-system                         kube-scheduler-somerville-jade-20240123-kind-control-plane            1/1     Running            0               32m
local-path-storage                  local-path-provisioner-6bc4bddd6b-58fs2                               1/1     Running            0               31m
Zarquan commented 9 months ago
kubectl \
    --kubeconfig "${workclusterconf:?}" \
    get pods \
        --all-namespaces

E0123 12:04:43.708937   14021 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0123 12:04:43.744442   14021 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0123 12:04:43.749007   14021 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0123 12:04:43.753613   14021 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAMESPACE                NAME                                                                                 READY   STATUS              RESTARTS        AGE
calico-system            calico-kube-controllers-6f994d9b59-z7fhs                                             0/1     Pending             0               23m
calico-system            calico-node-5jdwc                                                                    1/1     Running             0               3m14s
calico-system            calico-node-h7gcn                                                                    1/1     Running             0               2m59s
calico-system            calico-node-mqgjj                                                                    1/1     Running             0               22m
calico-system            calico-node-mvtbk                                                                    0/1     CrashLoopBackOff    6 (6m2s ago)    13m
calico-system            calico-node-nkmtb                                                                    1/1     Running             0               3m15s
calico-system            calico-node-p5vc8                                                                    1/1     Running             0               11m
calico-system            calico-node-pzbgq                                                                    1/1     Running             0               23m
calico-system            calico-node-qkdjb                                                                    1/1     Running             0               22m
calico-system            calico-node-rj7jk                                                                    1/1     Running             0               11m
calico-system            calico-node-z8fvd                                                                    1/1     Running             0               22m
calico-system            calico-typha-588d9ffd8c-6dfw8                                                        1/1     Running             0               22m
calico-system            calico-typha-588d9ffd8c-b5mmg                                                        1/1     Running             0               13m
calico-system            calico-typha-588d9ffd8c-btgbm                                                        1/1     Running             0               23m
calico-system            csi-node-driver-8kxmh                                                                2/2     Running             0               22m
calico-system            csi-node-driver-8xmbj                                                                2/2     Running             0               3m14s
calico-system            csi-node-driver-9dmxv                                                                0/2     ContainerCreating   0               13m
calico-system            csi-node-driver-cztgj                                                                2/2     Running             0               3m15s
calico-system            csi-node-driver-ktngs                                                                2/2     Running             0               22m
calico-system            csi-node-driver-pg72j                                                                2/2     Running             0               22m
calico-system            csi-node-driver-rhx88                                                                2/2     Running             0               11m
calico-system            csi-node-driver-rp8l6                                                                2/2     Running             0               11m
calico-system            csi-node-driver-vs6nf                                                                2/2     Running             0               2m59s
calico-system            csi-node-driver-zchr5                                                                2/2     Running             0               23m
gpu-operator             gpu-operator-6c8649c88c-x4pxp                                                        0/1     Pending             0               22m
kube-system              coredns-787d4945fb-fvbzn                                                             0/1     Pending             0               24m
kube-system              coredns-787d4945fb-lcvkz                                                             0/1     Pending             0               24m
kube-system              etcd-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb                      1/1     Running             0               24m
kube-system              kube-apiserver-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb            1/1     Running             0               24m
kube-system              kube-controller-manager-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb   1/1     Running             0               24m
kube-system              kube-proxy-2gv9n                                                                     1/1     Running             0               3m14s
kube-system              kube-proxy-8g2kg                                                                     1/1     Running             0               22m
kube-system              kube-proxy-dvcnv                                                                     1/1     Running             0               3m14s
kube-system              kube-proxy-gsklp                                                                     1/1     Running             0               11m
kube-system              kube-proxy-jmmf5                                                                     1/1     Running             0               22m
kube-system              kube-proxy-mzb58                                                                     1/1     Running             0               13m
kube-system              kube-proxy-pcnfv                                                                     1/1     Running             0               2m59s
kube-system              kube-proxy-sm294                                                                     1/1     Running             0               11m
kube-system              kube-proxy-wr85g                                                                     1/1     Running             0               24m
kube-system              kube-proxy-xvb9n                                                                     1/1     Running             0               22m
kube-system              kube-scheduler-somerville-jade-20240123-work-control-plane-aa1b69e0-s5ncb            1/1     Running             0               24m
kube-system              metrics-server-65cccfc7bb-lxcx6                                                      0/1     Pending             0               24m
kubernetes-dashboard     kubernetes-dashboard-85d67585b8-hfdk7                                                0/2     Pending             0               24m
network-operator         mellanox-network-operator-5f7b6b766c-4cxnj                                           0/1     Pending             0               22m
node-feature-discovery   node-feature-discovery-master-75c9d78d5f-f97c8                                       0/1     Pending             0               24m
node-feature-discovery   node-feature-discovery-worker-25wcz                                                  0/1     CrashLoopBackOff    7 (30s ago)     24m
node-feature-discovery   node-feature-discovery-worker-48nzw                                                  1/1     Running             1 (60s ago)     3m14s
node-feature-discovery   node-feature-discovery-worker-48zgq                                                  0/1     ContainerCreating   0               13m
node-feature-discovery   node-feature-discovery-worker-8qdt7                                                  1/1     Running             4 (15m ago)     22m
node-feature-discovery   node-feature-discovery-worker-cgww4                                                  1/1     Running             3 (15m ago)     22m
node-feature-discovery   node-feature-discovery-worker-gml9h                                                  1/1     Running             3 (15m ago)     22m
node-feature-discovery   node-feature-discovery-worker-nh9fd                                                  1/1     Running             4 (5m14s ago)   11m
node-feature-discovery   node-feature-discovery-worker-s6lbz                                                  1/1     Running             2 (31s ago)     3m14s
node-feature-discovery   node-feature-discovery-worker-svhbl                                                  1/1     Running             1 (25s ago)     2m59s
node-feature-discovery   node-feature-discovery-worker-wfd77                                                  0/1     Error               4 (5m53s ago)   11m
openstack-system         openstack-cinder-csi-controllerplugin-b5bb59498-6hnp8                                0/6     Pending             0               24m
openstack-system         openstack-cinder-csi-nodeplugin-7t5mk                                                0/3     ContainerCreating   0               22m
openstack-system         openstack-cinder-csi-nodeplugin-bj7c7                                                0/3     ContainerCreating   0               11m
openstack-system         openstack-cinder-csi-nodeplugin-cnlrv                                                0/3     ContainerCreating   0               3m14s
openstack-system         openstack-cinder-csi-nodeplugin-hwft8                                                0/3     ContainerCreating   0               3m14s
openstack-system         openstack-cinder-csi-nodeplugin-j6czn                                                0/3     ContainerCreating   0               2m59s
openstack-system         openstack-cinder-csi-nodeplugin-lpqd8                                                0/3     ContainerCreating   0               13m
openstack-system         openstack-cinder-csi-nodeplugin-n46lq                                                0/3     ContainerCreating   0               11m
openstack-system         openstack-cinder-csi-nodeplugin-rdqgl                                                0/3     ContainerCreating   0               22m
openstack-system         openstack-cinder-csi-nodeplugin-rj9k8                                                0/3     ContainerCreating   0               22m
openstack-system         openstack-cinder-csi-nodeplugin-tq5jc                                                0/3     ContainerCreating   0               24m
openstack-system         openstack-cloud-controller-manager-wmhqr                                             0/1     ContainerCreating   0               21m
tigera-operator          tigera-operator-7d4cfffc6-bv7gq                                                      1/1     Running             0               23m
Zarquan commented 9 months ago

This is exactly the same configuration that worked on Thursday 18th Jan.

Zarquan commented 9 months ago

3 control nodes 6 workers seem to be working, but we don't know why https://github.com/Zarquan/gaia-dmp/blob/master/notes/zrq/20240125-01-jade-debug.txt

I'd like to keep this open while we run some more tests.

GregBlow commented 9 months ago

At this point in time, with the inconsistency of the test results, I'm leaning towards thinking it is less likely to be a Somerville system issue, and something with the configuration of project, instances, or clusterAPI. I've not used clusterAPI to deploy production instances, however I have had some difficulty in the past deploying kubernetes that presented with CNI failure. In your comment here:

https://github.com/lsst-uk/somerville-operations/issues/144#issuecomment-1905901841

I would look first at the failed CNI pod, calico-node-mvtbk, with kubectl describe and see what it is complaining of.

Unsure if it's similar to my own experience, but CNI pods I'd expect to be deployed first with few dependencies, so if they're not working you don't have to eliminate the rest of the deployment as a possible confounding factor.

Zarquan commented 9 months ago

The same codebase works on Cambridge Arcus.

The deployment code is the same every time. The deployment procedure is the same every time. The only thing we have been tweaking are the DNS server address, images, flavors and node counts.

Tests results show that explicit DNS server address is not needed. Tests results show that the new images and flavors are working. Tests results show that multiple control nodes may make it more prone to fail.

As of this morning we have the full set of 3 control nodes and 6 worker nodes all using the new flavors and images. but we don't know why

20240110 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240110-01-somerville.txt

20240111 Working deployment on Cambridge Arcus https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240111-01-arcus-k8s.txt

20240112 Working deployment on Cambridge Arcus https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240112-03-arcus-dns.txt

20240112 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240112-05-jade-dns.txt

20240115 Working deployment on Cambridge Arcus https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240115-03-arcus-dns.txt

20240116 Failed deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240116-01-jade-dns.txt

20240116 Failed deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240116-02-jade-dns.txt

20240116 Failed deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240116-03-jade-dns.txt

20240117 Failed deployment on Somerville - using known good configuration https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240117-01-jade-dns.txt

20240118 Failed deployment on Somerville - using new app credentials https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240118-01-credentials.txt

20240118 Working deployment on Somerville - as if by magic, no changes to the config https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240118-02-jade-magic.txt

20240118 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240118-03-jade-dns.txt

20240119 Working deployment on Somerville https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240119-03-jade-dns.txt

20240123 Failed deployment on Somerville - same configuration as 20240119 https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240123-01-jade-debug.txt

20240124 Working deployment on Somerville - same configuration as 20240119 https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240124-01-jade-debug.txt

20240124 Working and failed deployment on Somerville - 1 control node works, 3 control nodes fails https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240124-02-jade-flavors.txt

20240125 Working deployment on Somerville - 1 control node works, 3 control nodes works https://github.com/wfau/gaia-dmp/blob/master/notes/zrq/20240125-01-jade-debug.txt

On 2 occasions we have explicitly re-tested a previous configuration and got a different result. https://github.com/wfau/gaia-dmp/blob/ed7e8b8475f3cc51eabb10efd0fc86ab5651f702/notes/zrq/20240117-01-jade-dns.txt#L33-L43 https://github.com/wfau/gaia-dmp/blob/ed7e8b8475f3cc51eabb10efd0fc86ab5651f702/notes/zrq/20240123-01-jade-debug.txt#L33-L41

Something somewhere is flakey. I'll check the status of the CNI pods next time it fails.

Zarquan commented 9 months ago

Personally, I suspect this might be due to issues with the LoadBalancers in the Somerville system. However it is a non-trivial task to try to narrow this down and identify the cause.

What we need are some debug tools provided by StackHPC that can be pointed at kubectl endpoint to run some basic unit tests and debug health checks to check for obvious issues with the underlying platform.

I've created an issue in the StackHPC capi-helm-charts GitHub project. If you think these tools would be useful for you please add a comment to boost the priority of the issue.

https://github.com/stackhpc/capi-helm-charts/issues/232

Zarquan commented 9 months ago

The StackHPC capi-helm-charts are supposed to make this kind of deployment quick and easy. If they work, they are fine. If the deploy doesn't work, the user is left wandering for days in a complex maze of things they didn't know they needed.

Zarquan commented 9 months ago

Deployment working on Monday 29th Jan. Collected some information about the load balancer, members and pools, for comparison if/when it fails.

Zarquan commented 8 months ago

It has stopped working again. Exactly the same configuration as the previous attempts. 20240217 FAIL - same configuration as above. 20240218 FAIL - same configuration as above.

millingw commented 8 months ago

Tried running the same deployment configuration this morning from an EIDF VM, seeing the same failures as reported by Dave.

Every 2.0s: clusterctl --kubeconfig /opt/aglais/somerville-jade-20240229-kind.yml describe cluster somerville-...  somerville-jade-20240229-bootstrap-node.novalocal: Thu Feb 29 11:16:48 2024

NAME                                                                              READY  SEVERITY  REASON                       SINCE  MESSAGE

Cluster/somerville-jade-20240229-work                                             False  Warning   ScalingUp                    19m    Scaling up control plane to 3 replicas (actual 1)

├─ClusterInfrastructure - OpenStackCluster/somerville-jade-20240229-work

├─ControlPlane - KubeadmControlPlane/somerville-jade-20240229-work-control-plane  False  Warning   ScalingUp                    19m    Scaling up control plane to 3 replicas (actual 1)

│ └─Machine/somerville-jade-20240229-work-control-plane-rfqqm                     False  Warning   NodeStartupTimeout           7m40s  Node failed to report startup in 10m0s

└─Workers

  └─MachineDeployment/somerville-jade-20240229-work-md-0                          False  Warning   WaitingForAvailableMachines  21m    Minimum availability requires 5 replicas, current 0 ava
ilable
    └─6 Machines...                                                               True                                          7m11s  See somerville-jade-20240229-work-md-0-bbgwv-27lm2, som
erville-jade-20240229-work-md-0-bbgwv-5g576, ...
astrodb commented 6 months ago

@millingw Can you run the deployment again and let us know if this fails?

Zarquan commented 6 months ago

This is based on my code for running the capi Helm charts, which could quite easily be wrong. If the CAPI/MagnumAPI interface is reliably creating Kubernetes clusters, then no need to chase this issue.