metallb / metallb-operator

MetalLB Operator for deploying metallb
Apache License 2.0
95 stars 68 forks source link

Web hook pod (frr-k8s-webhook-server) is restarting at least 3 times before healthy #494

Open karampok opened 2 months ago

karampok commented 2 months ago

When running the E2E to check the frr-k8s https://github.com/metallb/metallb-operator/blob/main/test/e2e/functional/tests/e2e.go#L282

Test is green but the pod is restarting before it becomes healthy/ready

 -n metallb-system get pods -l  component=frr-k8s-webhook-server -o wide -w
NAME                                      READY   STATUS             RESTARTS     AGE   IP           NODE          NOMINATED NODE   READINESS GATES
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     CrashLoopBackOff   2 (6s ago)   29s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Running            3 (22s ago)   45s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   1/1     Running            3 (38s ago)   61s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   1/1     Terminating        3 (69s ago)   92s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   <none>       kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv   0/1     Terminating        3 (70s ago)   93s   10.244.2.7   kind-worker   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Pending            0             0s    <none>       <none>        <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Pending            0             0s    <none>       kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     ContainerCreating   0             0s    <none>       kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             0             1s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Completed           0             2s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             1 (2s ago)    3s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Error               1 (4s ago)    5s    10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     CrashLoopBackOff    1 (6s ago)    10s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             2 (20s ago)   24s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Completed           2 (21s ago)   25s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     CrashLoopBackOff    2 (2s ago)    26s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   0/1     Running             3 (33s ago)   57s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   1/1     Running             3 (46s ago)   70s   10.244.1.5   kind-worker2   <none>           <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf   1/1     Terminating         3 (78s ago)   102s   10.244.1.5   kind-worker2   <none>           <none>
DanielOsypenko commented 2 weeks ago

with latest 4.16 Metallb we have an ImagePullBackOff on controller,speakerandfrr-k8s` pods pods are partially deployed with 1/2, 4/6 containers ready. It might be a related issue, but the outcomes are worse.

oc get csv
NAME                                         DISPLAY                          VERSION               REPLACES                                     PHASE
ingress-node-firewall.v4.16.0-202409051837   Ingress Node Firewall Operator   4.16.0-202409051837   ingress-node-firewall.v4.16.0-202410011135   Succeeded
metallb-operator.v4.16.0-202410292005        MetalLB Operator                 4.16.0-202410292005   metallb-operator.v4.16.0-202410251707        Succeeded 

webhook pod shows up TLS handshake errors in logs

(*runnableGroup).reconcile.func1\n\t/metallb/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223"}
2024/10/30 05:54:02 http: TLS handshake error from 10.130.0.41:48190: remote error: tls: bad certificate
2024/10/30 05:54:03 http: TLS handshake error from 10.130.0.41:48200: remote error: tls: bad certificate
2024/10/30 05:54:05 http: TLS handshake error from 10.130.0.41:48206: remote error: tls: bad certificate
2024/10/30 05:54:05 http: TLS handshake error from 10.130.0.41:48208: remote error: tls: bad certificate
2024/10/30 05:54:06 http: TLS handshake error from 10.130.0.41:48218: remote error: tls: bad certificate
2024/10/30 05:54:08 http: TLS handshake error from 10.130.0.41:58904: remote error: tls: bad certificate
2024/10/30 05:54:08 http: TLS handshake error from 10.130.0.41:58916: remote error: tls: bad certificate
2024/10/30 05:54:09 http: TLS handshake error from 10.130.0.41:58918: remote error: tls: bad certificate
2024/10/30 05:54:11 http: TLS handshake error from 10.130.0.41:58928: remote error: tls: bad certificate
2024/10/30 05:54:14 http: TLS handshake error from 10.130.0.41:58940: remote error: tls: bad certificate
2024/10/30 05:54:15 http: TLS handshake error from 10.130.0.41:58954: remote error: tls: bad certificate
2024/10/30 05:54:17 http: TLS handshake error from 10.130.0.41:58964: remote error: tls: bad certificate
2024/10/30 05:54:17 http: TLS handshake error from 10.130.0.41:58978: remote error: tls: bad certificate
2024/10/30 05:54:18 http: TLS handshake error from 10.130.0.41:44000: remote error: tls: bad certificate
2024/10/30 05:54:20 http: TLS handshake error from 10.130.0.41:44014: remote error: tls: bad certificate
2024/10/30 05:54:23 http: TLS handshake error from 10.130.0.41:44024: remote error: tls: bad certificate
2024/10/30 05:54:24 http: TLS handshake error from 10.130.0.41:44038: remote error: tls: bad certificate
2024/10/30 05:54:26 http: TLS handshake error from 10.130.0.41:44052: remote error: tls: bad certificate

Hosting cluster lacks these services:

frr-k8s-monitor-service 
frr-k8s-webhook-service 

Hosted kubevirt clusters fail to pull images and deploy operators showing up DeadlineExceeded error


Also another cluster that uses latest 4.17 version, has the same ImagePullBackOff errors on controller, speaker and frr-k8s but it seems to be working as expected.

 cc get csv
NAME                                         DISPLAY                          VERSION               REPLACES                                     PHASE
ingress-node-firewall.v4.17.0-202410011205   Ingress Node Firewall Operator   4.17.0-202410011205   ingress-node-firewall.v4.17.0-202410211206   Succeeded
metallb-operator.v4.17.0-202410241236        MetalLB Operator                 4.17.0-202410241236   
fedepaol commented 2 weeks ago

@DanielOsypenko there's no 4.16 version. This is the community version of the operator. If this is happening on openshift I suggest following up on Red Hat channels.