metallb / metallb-operator

MetalLB Operator for deploying metallb
Apache License 2.0
96 stars 68 forks source link

The resource daemonset.apps/speaker went down after upgrading the operator and metallb with the manifest v0.13.11 #379

Open elevesque-sfr opened 1 year ago

elevesque-sfr commented 1 year ago

MetalLB Version operator v0.13.11 metallb v0.13.10

OS : Talos 1.3.7 Kubernetes : 1.24.9 CNI : Cilium 1.12.4

After upgrading from operator v0.13.4/metallb v0.13.5 to operator v0.13.10/metallb v0.13.11, the resource daemonset.apps/speaker went down and restarted after few minutes.

[eric@macross ~]$ kubectl get all
NAME                                                      READY   STATUS             RESTARTS         AGE
pod/controller-db6f6ff7d-zjfcr                            1/1     Running            0                70s
pod/metallb-operator-controller-manager-6fd4d656f-tx2hj   1/1     Running            0                15m
pod/metallb-operator-webhook-server-588bbdf874-g2jsd      1/1     Running            0                2m53s
pod/speaker-2tvk6                                         0/1     CrashLoopBackOff   33 (3m3s ago)    3h36m
pod/speaker-5v2sp                                         0/1     CrashLoopBackOff   33 (2m18s ago)   3h36m
pod/speaker-p7spx                                         0/1     CrashLoopBackOff   33 (3m59s ago)   20h
pod/speaker-wrs8n                                         0/1     CrashLoopBackOff   33 (3m59s ago)   3h37m
pod/speaker-xfj7v                                         0/1     CrashLoopBackOff   33 (3m32s ago)   3h36m

Looking at the logs of one of the pod, errors on get and watch configmaps appears and the speacker pod went down.

W0825 11:41:31.682290       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:31.682339       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:41:33.520445       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:33.520473       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:41:39.101431       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:39.101463       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:41:46.581417       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:41:46.581469       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
W0825 11:42:03.218915       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:42:03.219009       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
[...]
W0825 11:42:37.744778       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
E0825 11:42:37.744806       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.26.4/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:metallb-system:speaker" cannot list resource "configmaps" in API group "" in the namespace "metallb-system"
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"Could not wait for Cache to sync","controller":"node","controllerGroup":"","controllerKind":"Node","error":"failed to wait for node caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"Could not wait for Cache to sync","controller":"service","controllerGroup":"","controllerKind":"Service","error":"failed to wait for service caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"Could not wait for Cache to sync","controller":"bgppeer","controllerGroup":"metallb.io","controllerKind":"BGPPeer","error":"failed to wait for bgppeer caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:242\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"error received after stop sequence was engaged","error":"failed to wait for service caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/internal.go:555"}
{"level":"error","ts":"2023-08-25T11:43:30Z","msg":"error received after stop sequence was engaged","error":"failed to wait for bgppeer caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/internal.go:555"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for caches"}
{"level":"error","ts":"2023-08-25T11:43:30Z","logger":"controller-runtime.source","msg":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/source/source.go:148\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.26.0/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/go/pkg/mod/k8s.io/apimachinery@v0.26.0/pkg/util/wait/wait.go:582\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.26.0/pkg/util/wait/wait.go:547\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/source/source.go:136"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2023-08-25T11:43:30Z","msg":"Wait completed, proceeding to shutdown the manager"}
{"caller":"main.go:201","error":"failed to wait for node caches to sync: timed out waiting for cache to be synced","level":"error","msg":"failed to run k8s client","op":"startup","ts":"2023-08-25T11:43:30Z"}

Initial installation and upgrade were both done using the manifest.

As a workaround, we added in the clusterrole metallb-system:speaker the autorization to get/list/watch the resource configmaps.

[eric@macross ~]$ kubectl get clusterrole metallb-system:speaker -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"app":"metallb"},"name":"metallb-system:speaker"},"rules":[{"apiGroups":[""],"resources":["services","endpoints","nodes","namespaces"],"verbs":["get","list","watch"]},{"apiGroups":["discovery.k8s.io"],"resources":["endpointslices"],"verbs":["get","list","watch"]},{"apiGroups":[""],"resources":["events"],"verbs":["create","patch"]},{"apiGroups":["policy"],"resourceNames":["speaker"],"resources":["podsecuritypolicies"],"verbs":["use"]}]}
  creationTimestamp: "2022-09-13T07:16:45Z"
  labels:
    app: metallb
  name: metallb-system:speaker
  resourceVersion: "132426474"
  uid: 12d48a2c-8274-49f7-8e51-aed128a7b112
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - nodes
  - namespaces
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - policy
  resourceNames:
  - speaker
  resources:
  - podsecuritypolicies
  verbs:
  - use

After this modification and a full restart, everything is now working perfectly.

[eric@macross ~]$ kubectl get po -o wide -w
NAME                                                  READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
controller-db6f6ff7d-zjfcr                            1/1     Running   0          24m   10.19.3.207    kw905-vso-pr   <none>           <none>
metallb-operator-controller-manager-6fd4d656f-tx2hj   1/1     Running   0          39m   10.19.3.131    kw905-vso-pr   <none>           <none>
metallb-operator-webhook-server-588bbdf874-g2jsd      1/1     Running   0          26m   10.19.3.208    kw905-vso-pr   <none>           <none>
speaker-5vqsf                                         1/1     Running   0          15m   10.4.205.104   kw902-vso-pr   <none>           <none>
speaker-8jjhv                                         1/1     Running   0          14m   10.4.205.103   kw901-vso-pr   <none>           <none>
speaker-jlz9b                                         1/1     Running   0          15m   10.4.205.107   kw905-vso-pr   <none>           <none>
speaker-jtcxx                                         1/1     Running   0          15m   10.4.205.106   kw904-vso-pr   <none>           <none>
speaker-nlwxq                                         1/1     Running   0          15m   10.4.205.105   kw903-vso-pr   <none>           <none>
[eric@macross ~]$ kubectl logs speaker-jtcxx
[...]
{"level":"info","ts":"2023-08-25T11:47:09Z","msg":"Starting workers","controller":"service","controllerGroup":"","controllerKind":"Service","worker count":1}
{"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2023-08-25T11:47:09Z"}
{"level":"info","ts":"2023-08-25T11:47:09Z","msg":"Starting workers","controller":"node","controllerGroup":"","controllerKind":"Node","worker count":1}
{"level":"info","ts":"2023-08-25T11:47:09Z","msg":"Starting workers","controller":"bgppeer","controllerGroup":"metallb.io","controllerKind":"BGPPeer","worker count":1}
{"caller":"node_controller.go:46","controller":"NodeReconciler","level":"info","start reconcile":"/km901-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"config_controller.go:59","controller":"ConfigReconciler","level":"info","start reconcile":"/kw905-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"node_controller.go:69","controller":"NodeReconciler","end reconcile":"/km901-vso-pr","level":"info","ts":"2023-08-25T11:47:09Z"}
[...]
{"caller":"config_controller.go:59","controller":"ConfigReconciler","level":"info","start reconcile":"/km902-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"speakerlist.go:310","level":"info","msg":"node event - forcing sync","node addr":"10.4.205.105","node event":"NodeJoin","node name":"kw903-vso-pr","ts":"2023-08-25T11:47:09Z"}
{"caller":"main.go:374","event":"serviceAnnounced","ips":["10.4.207.211"],"level":"info","msg":"service has IP, announcing","pool":"vip-pool","protocol":"layer2","ts":"2023-08-25T11:47:09Z"}
{"caller":"service_controller_reload.go:104","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2023-08-25T11:47:09Z"}
[...]
{"caller":"speakerlist.go:310","level":"info","msg":"node event - forcing sync","node addr":"10.4.205.103","node event":"NodeJoin","node name":"kw901-vso-pr","ts":"2023-08-25T11:47:40Z"}
{"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2023-08-25T11:47:40Z"}
{"caller":"main.go:418","event":"serviceWithdrawn","ip":["10.4.207.209"],"ips":["10.4.207.209"],"level":"info","msg":"withdrawing service announcement","pool":"vip-pool","protocol":"layer2","reason":"notOwner","ts":"2023-08-25T11:47:40Z"}
{"caller":"main.go:374","event":"serviceAnnounced","ips":["10.4.207.211"],"level":"info","msg":"service has IP, announcing","pool":"vip-pool","protocol":"layer2","ts":"2023-08-25T11:47:40Z"}
{"caller":"service_controller_reload.go:104","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2023-08-25T11:47:40Z"}
[eric@macross ~]$ curl -Is http://argocd.tooling-nms-preprod.valentine.sfr.com/ | head -n 1
HTTP/1.1 200 OK

The diff between the original manifest and the one we used for upgrade.

[eric@macross metallb]$ diff metallb-operator.yaml metallb-operator-0.13.10.yaml 
3587c3587
<           value: quay.io/metallb/speaker:v0.13.9
---
>           value: quay.io/metallb/speaker:v0.13.10
3589c3589
<           value: quay.io/metallb/controller:v0.13.9
---
>           value: quay.io/metallb/controller:v0.13.10
3664c3664
<         image: quay.io/metallb/controller:v0.13.9
---
>         image: quay.io/metallb/controller:v0.13.10
4212a4213
>   - configmaps
xeonkeeper commented 1 year ago

I got the same issue, but after install the cluster with kubespray from scratch, and specify MetalLB 0.13.11 version. Fixed with your solution (but also, I had an error with get nodes, and I added "nodes" too)! Thanks