Created EndpointSlices miss service name label and make metallb crash

goncalopcarvalho commented 2 years ago

What happened: The EndpointSlices resources created by submariner are making metallb speaker pods crash during runtime. This behaviour is due to the EndpointSlices missing a service name label, and it can be manually fixed by adding the label to the EndpointSlice resource in question. All the verifications mentioned in the documentation succeeded.

What you expected to happen: Using the subctl verify command, for example, leads to metallb speaker pods crash during the performed tests. I expect this behaviour to not happen due to created EndpointSlices resources missing a kubernetes.io/service-name label. The following logs are from the metallb speaker pod after this crash.

E0810 16:44:07.664127       1 runtime.go:78] Observed a panic: &errors.errorString{s:"unable to calculate an index entry for key \"e2e-tests-discovery-2nf4z/nginx-demo-cluster-a\" on index \"ServiceName\": endpointSlice missing kubernetes.io/service-name label"} (unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label)
goroutine 98 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1654c40, 0xc0006185a0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/client-go/tools/cache.(*threadSafeMap).updateIndices(0xc0002f3740, 0x0, 0x0, 0x18277e0, 0xc0006d0580, 0xc0005aef60, 0x2e)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:264 +0x4bd
k8s.io/client-go/tools/cache.(*threadSafeMap).Add(0xc0002f3740, 0xc0005aef60, 0x2e, 0x18277e0, 0xc0006d0580)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:78 +0x145
k8s.io/client-go/tools/cache.(*cache).Add(0xc00009e060, 0x18277e0, 0xc0006d0580, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/store.go:150 +0x105
k8s.io/client-go/tools/cache.newInformer.func1(0x168fbc0, 0xc0006abf20, 0x1, 0xc0006abf20)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:404 +0x15b
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000371220, 0xc0002f37d0, 0x0, 0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/delta_fifo.go:507 +0x322
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0001d22d0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:183 +0x42
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000585f90)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000585f90, 0x1a1af20, 0xc000356000, 0xc0000ce001, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000585f90, 0x3b9aca00, 0x0, 0xc00032a401, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc0001d22d0, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:154 +0x2e5
created by go.universe.tf/metallb/internal/k8s.(*Client).Run
        /home/circleci/project/internal/k8s/k8s.go:416 +0x395
E0810 16:44:07.664530       1 runtime.go:78] Observed a panic: &errors.errorString{s:"unable to calculate an index entry for key \"e2e-tests-discovery-2nf4z/nginx-demo-cluster-a\" on index \"ServiceName\": endpointSlice missing kubernetes.io/service-name label"} (unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label)
goroutine 98 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1654c40, 0xc0006185a0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/client-go/tools/cache.(*threadSafeMap).updateIndices(0xc0002f3740, 0x0, 0x0, 0x18277e0, 0xc0006d0580, 0xc0005aef60, 0x2e)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:264 +0x4bd
k8s.io/client-go/tools/cache.(*threadSafeMap).Add(0xc0002f3740, 0xc0005aef60, 0x2e, 0x18277e0, 0xc0006d0580)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:78 +0x145
k8s.io/client-go/tools/cache.(*cache).Add(0xc00009e060, 0x18277e0, 0xc0006d0580, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/store.go:150 +0x105
k8s.io/client-go/tools/cache.newInformer.func1(0x168fbc0, 0xc0006abf20, 0x1, 0xc0006abf20)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:404 +0x15b
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000371220, 0xc0002f37d0, 0x0, 0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/delta_fifo.go:507 +0x322
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0001d22d0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:183 +0x42
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000585f90)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006f5f90, 0x1a1af20, 0xc000356000, 0xc0000ce001, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000585f90, 0x3b9aca00, 0x0, 0xc00032a401, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc0001d22d0, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:154 +0x2e5
created by go.universe.tf/metallb/internal/k8s.(*Client).Run
        /home/circleci/project/internal/k8s/k8s.go:416 +0x395
panic: unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label [recovered]
        panic: unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label [recovered]
        panic: unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label

goroutine 98 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/client-go/tools/cache.(*threadSafeMap).updateIndices(0xc0002f3740, 0x0, 0x0, 0x18277e0, 0xc0006d0580, 0xc0005aef60, 0x2e)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:264 +0x4bd
k8s.io/client-go/tools/cache.(*threadSafeMap).Add(0xc0002f3740, 0xc0005aef60, 0x2e, 0x18277e0, 0xc0006d0580)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:78 +0x145
k8s.io/client-go/tools/cache.(*cache).Add(0xc00009e060, 0x18277e0, 0xc0006d0580, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/store.go:150 +0x105
k8s.io/client-go/tools/cache.newInformer.func1(0x168fbc0, 0xc0006abf20, 0x1, 0xc0006abf20)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:404 +0x15b
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000371220, 0xc0002f37d0, 0x0, 0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/delta_fifo.go:507 +0x322
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0001d22d0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:183 +0x42
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000585f90)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006f5f90, 0x1a1af20, 0xc000356000, 0xc0000ce001, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000585f90, 0x3b9aca00, 0x0, 0xc00032a401, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc0001d22d0, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:154 +0x2e5
created by go.universe.tf/metallb/internal/k8s.(*Client).Run
        /home/circleci/project/internal/k8s/k8s.go:416 +0x395

How to reproduce it (as minimally and precisely as possible):

Setup two K3s cluster with submariner strictly following this documentation.
Setup one kubeadm cluster using this guide with Calico as CNI, creating the respective IPPools for the other two clusters as mentioned here.
Join the kubeadm cluster as a third cluster to the two K3s clusters using submariner, strictly considering this guide.
Integrate metallb-system following this documentation.
Use subctl verify in the kubeadm cluster which leads to metallb speaker pods to crash.

Anything else we need to know?: This same issue has been mentioned in https://github.com/metallb/metallb/issues/1175 and in https://github.com/submariner-io/submariner/issues/1869.

Environment: We're using submariner to connect three clusters, two K3s single node cluster and one kubeadm single node cluster. We consider the K3s clusters as cluster-a and cluster-b, and the kubeadm cluster as cluster-c. cluster-a is where the submariner broker is deployed. All the problems mentioned on this post are happening on cluster-c.

Diagnose information (use subctl diagnose all):

subctl gather --kubeconfig=kubeconfig.cluster-c
Cluster "kubernetes"
Gathering information from cluster "kubernetes"
✓ Gathering connectivity logs
✓ Found 1 pods matching label selector "app=submariner-gateway"
✓ Found 1 pods matching label selector "app=submariner-routeagent"
✓ Found 0 pods matching label selector "app=submariner-globalnet"
✓ Found 0 pods matching label selector "app=submariner-networkplugin-syncer"
✓ Gathering connectivity resources
✓ Gathering CNI data from 1 pods matching label selector "app=submariner-routeagent"
✓ Gathering CNI data from 1 pods matching label selector "app=submariner-gateway"
✓ Gathering globalnet data from 0 pods matching label selector "app=submariner-globalnet"
✓ Gathering cable driver data from 1 pods matching label selector "app=submariner-gateway"
✓ Found 3 endpoints in namespace "submariner-operator"
✓ Found 3 clusters in namespace "submariner-operator"
✓ Found 1 gateways in namespace "submariner-operator"
✓ Found 0 clusterglobalegressips in namespace ""
✓ Found 0 globalegressips in namespace ""
✓ Found 0 globalingressips in namespace ""
✓ Gathering service-discovery logs
✓ Found 3 pods matching label selector "component=submariner-lighthouse"
✓ Found 2 pods matching label selector "k8s-app=kube-dns"
✓ Gathering service-discovery resources
✓ Found 1 serviceexports in namespace ""
✓ Found 1 serviceimports in namespace ""
✓ Found 1 endpointslices by label selector "endpointslice.kubernetes.io/managed-by=lighthouse-agent.submariner.io" in namespace ""
✓ Found 1 configmaps by label selector "component=submariner-lighthouse" in namespace "submariner-operator"
✓ Found 1 configmaps by field selector "metadata.name=coredns" in namespace "kube-system"
✓ Found 0 services by label selector "submariner.io/exportedServiceRef" in namespace ""
✓ Gathering broker logs
✓ Gathering broker resources
✓ Found 3 endpoints in namespace "submariner-k8s-broker"
✓ Found 3 clusters in namespace "submariner-k8s-broker"
✓ Found 1 endpointslices by label selector "endpointslice.kubernetes.io/managed-by=lighthouse-agent.submariner.io" in namespace "submariner-k8s-broker"
✓ Found 1 serviceimports in namespace "submariner-k8s-broker"
⚠ Gathering operator logs
✓ Found 1 pods matching label selector "name=submariner-operator"
⚠ Found logs for previous instances of pod submariner-operator-657ccb8f84-8vhsh
✓ Gathering operator resources
✓ Found 1 submariners in namespace "submariner-operator"
✓ Found 1 servicediscoveries in namespace "submariner-operator"
✓ Found 1 deployments by field selector "metadata.name=submariner-operator" in namespace "submariner-operator"
✓ Found 1 daemonsets by label selector "app=submariner-gateway" in namespace "submariner-operator"
✓ Found 1 daemonsets by label selector "app=submariner-routeagent" in namespace "submariner-operator"
✓ Found 0 daemonsets by label selector "app=submariner-globalnet" in namespace "submariner-operator"
✓ Found 0 deployments by label selector "app=submariner-networkplugin-syncer" in namespace "submariner-operator"
✓ Found 1 deployments by label selector "app=submariner-lighthouse-agent" in namespace "submariner-operator"
✓ Found 1 deployments by label selector "app=submariner-lighthouse-coredns" in namespace "submariner-operator"
Files are stored under directory "submariner-20220810165824"

Gather information (use subctl gather):


subctl diagnose all --kubeconfig=kubeconfig.cluster-c
Cluster "kubernetes"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.24.3" is supported

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Trying to detect the Calico ConfigMap
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured

✓ Checking gateway connections
✓ All connections are established

✓ Non-Globalnet deployment detected - checking if cluster CIDRs overlap
✓ Clusters do not have overlapping CIDRs
✓ Checking Submariner pods
✓ All Submariner pods are up and running

✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported

✓ Checking the firewall configuration to determine if the metrics port (8080) is allowed
✓ Skipping this check as it's a single node cluster

✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✓ Skipping this check as it's a single node cluster

✓ Globalnet is not installed - skipping

Skipping inter-cluster firewall check as it requires two kubeconfigs. Please run "subctl diagnose firewall inter-cluster" command manually.

nyechiel commented 2 years ago

@yboaron did you have a chance to look into this?

yboaron commented 2 years ago

@yboaron did you have a chance to look into this?

Not yet, it's on my to-do list, hope to get to it in the next 1-2 weeks.

yboaron commented 2 years ago

Hi @Prophetick , I tried to reproduce this issue on Kind environment but was unable to reproduce it (MetalLB speaker pods restart).

To deploy Submariner (latest devel) + MetalLB (version 0.13.5) on Kind, I downloaded the latest https://github.com/submariner-io/submariner-operator and run : make deploy using=load-balancer,lighthouse

I run both Submariner e2e tests (subctl verify ) and also manual verification while I can see that endpointsslice resource generated by Submariner Lighthouse [1] do not include the kubernetes.io/service-name label but include the multicluster.kubernetes.io/service-name label [1] the MetalLB speaker pods didn't crash/restart.

Is the problem still relevant in the latest stable version of MetalLB? If so, could you please describe how to reproduce it?

[1]

$ kubectl --kubeconfig output/kubeconfigs/kind-config-cluster2 get endpointslice nginx-cluster2 -o yaml addressType: IPv4 apiVersion: discovery.k8s.io/v1 endpoints:

addresses:
- 10.131.1.27 conditions: ready: true hostname: nginx-76d6c9b8c-d9wqm nodeName: cluster2-worker kind: EndpointSlice metadata: creationTimestamp: "2022-10-31T11:11:29Z" generation: 2 labels: endpointslice.kubernetes.io/managed-by: lighthouse-agent.submariner.io lighthouse.submariner.io/sourceNamespace: default multicluster.kubernetes.io/service-name: nginx multicluster.kubernetes.io/source-cluster: cluster2 name: nginx-cluster2 namespace: default resourceVersion: "18907" uid: 10ab26f2-afd6-4f8c-a803-e47055264326 ports:
name: "" port: 80 protocol: TCP

goncalopcarvalho commented 2 years ago

Hi all, thanks for the reply.

When I opened this issue I was using Submariner 0.12.2 and MetalLB 0.12.1.

I can confirm that upgrading Submariner to version 0.13.1 and MetalLB to version 0.13.5 solved the problem for me, I can no longer reproduce this.

nyechiel commented 2 years ago

Closing as per the last comment.

Woytek-Polnik commented 1 year ago

It does not seem to be solved by Submariner, but ignored silently by MetalLB, does not seem to be solved. Have 0.15.2 Submariner and OCP MetalLB 0.10 The latest OCP MetalLB is 0.11

vthapar commented 1 year ago

@Woytek-Polnik EndpointSlices created by Lighthouse are required to have label multicluster.kubernetes.io/service-name as per the KEP-1645. kubernetes.io/service-name is for endpointslices created by kubernetes for local services. Different labels are to distinguish between lcoal endpointslices and Exported service's EndpointSlices.

This looks like a bug in MetalLB. It needs to be Multicluster aware. Assumption that all EndpointSlices must have kubernetes.io/service-name is incorrect. If MetalLB is only interested in local services and hteir EndpointSlices, it should ignore EndpointSlices with the multicluster label. If it is interested in multicluster Endpoint slices, it should honor the multicluster label and use that.

Woytek-Polnik commented 1 year ago

@vthapar Thank you for the extra context. Totally makes sense to me to just ignore them. In that case, I have to push the MetalLB operator in OCP to be ⬆️

fedepaol commented 1 year ago

I landed on this thread by chance following the comment to https://github.com/metallb/metallb/issues/1175

MetalLB is not multicluster aware, and crashing in that scenario is certainly not the solution. We can make it more robust (and it was already done, incidentally?) I am not sure what the behaviour should be because I need to read the KEP and the related CNI implications (i.e. what happens if the traffic directed to a LB lands to a node that belongs to a different cluster than the cluster the service is defined on, if the service is mirrored to all the clusters, etc) so I'd split the fix in two: avoid crashing (if we still do) and ignore those ep slices as a bug fix, and managing the multi cluster scenario as an enhancement (if it makes sense).

This is from community metallb point of view. @Woytek-Polnik, if you are eligible for support on openshift please reach RH through the proper channels.

submariner-io / lighthouse

Created EndpointSlices miss service name label and make metallb crash #881