submariner-io / lighthouse

DNS service discovery across connected Kubernetes clusters.
https://submariner-io.github.io/architecture/service-discovery/
Apache License 2.0
101 stars 35 forks source link

Created EndpointSlices miss service name label and make metallb crash #881

Closed goncalopcarvalho closed 2 years ago

goncalopcarvalho commented 2 years ago

What happened: The EndpointSlices resources created by submariner are making metallb speaker pods crash during runtime. This behaviour is due to the EndpointSlices missing a service name label, and it can be manually fixed by adding the label to the EndpointSlice resource in question. All the verifications mentioned in the documentation succeeded.

What you expected to happen: Using the subctl verify command, for example, leads to metallb speaker pods crash during the performed tests. I expect this behaviour to not happen due to created EndpointSlices resources missing a kubernetes.io/service-name label. The following logs are from the metallb speaker pod after this crash.

E0810 16:44:07.664127       1 runtime.go:78] Observed a panic: &errors.errorString{s:"unable to calculate an index entry for key \"e2e-tests-discovery-2nf4z/nginx-demo-cluster-a\" on index \"ServiceName\": endpointSlice missing kubernetes.io/service-name label"} (unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label)
goroutine 98 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1654c40, 0xc0006185a0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/client-go/tools/cache.(*threadSafeMap).updateIndices(0xc0002f3740, 0x0, 0x0, 0x18277e0, 0xc0006d0580, 0xc0005aef60, 0x2e)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:264 +0x4bd
k8s.io/client-go/tools/cache.(*threadSafeMap).Add(0xc0002f3740, 0xc0005aef60, 0x2e, 0x18277e0, 0xc0006d0580)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:78 +0x145
k8s.io/client-go/tools/cache.(*cache).Add(0xc00009e060, 0x18277e0, 0xc0006d0580, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/store.go:150 +0x105
k8s.io/client-go/tools/cache.newInformer.func1(0x168fbc0, 0xc0006abf20, 0x1, 0xc0006abf20)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:404 +0x15b
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000371220, 0xc0002f37d0, 0x0, 0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/delta_fifo.go:507 +0x322
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0001d22d0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:183 +0x42
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000585f90)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000585f90, 0x1a1af20, 0xc000356000, 0xc0000ce001, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000585f90, 0x3b9aca00, 0x0, 0xc00032a401, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc0001d22d0, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:154 +0x2e5
created by go.universe.tf/metallb/internal/k8s.(*Client).Run
        /home/circleci/project/internal/k8s/k8s.go:416 +0x395
E0810 16:44:07.664530       1 runtime.go:78] Observed a panic: &errors.errorString{s:"unable to calculate an index entry for key \"e2e-tests-discovery-2nf4z/nginx-demo-cluster-a\" on index \"ServiceName\": endpointSlice missing kubernetes.io/service-name label"} (unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label)
goroutine 98 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1654c40, 0xc0006185a0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/client-go/tools/cache.(*threadSafeMap).updateIndices(0xc0002f3740, 0x0, 0x0, 0x18277e0, 0xc0006d0580, 0xc0005aef60, 0x2e)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:264 +0x4bd
k8s.io/client-go/tools/cache.(*threadSafeMap).Add(0xc0002f3740, 0xc0005aef60, 0x2e, 0x18277e0, 0xc0006d0580)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:78 +0x145
k8s.io/client-go/tools/cache.(*cache).Add(0xc00009e060, 0x18277e0, 0xc0006d0580, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/store.go:150 +0x105
k8s.io/client-go/tools/cache.newInformer.func1(0x168fbc0, 0xc0006abf20, 0x1, 0xc0006abf20)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:404 +0x15b
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000371220, 0xc0002f37d0, 0x0, 0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/delta_fifo.go:507 +0x322
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0001d22d0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:183 +0x42
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000585f90)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006f5f90, 0x1a1af20, 0xc000356000, 0xc0000ce001, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000585f90, 0x3b9aca00, 0x0, 0xc00032a401, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc0001d22d0, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:154 +0x2e5
created by go.universe.tf/metallb/internal/k8s.(*Client).Run
        /home/circleci/project/internal/k8s/k8s.go:416 +0x395
panic: unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label [recovered]
        panic: unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label [recovered]
        panic: unable to calculate an index entry for key "e2e-tests-discovery-2nf4z/nginx-demo-cluster-a" on index "ServiceName": endpointSlice missing kubernetes.io/service-name label

goroutine 98 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1654c40, 0xc0006185a0)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/client-go/tools/cache.(*threadSafeMap).updateIndices(0xc0002f3740, 0x0, 0x0, 0x18277e0, 0xc0006d0580, 0xc0005aef60, 0x2e)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:264 +0x4bd
k8s.io/client-go/tools/cache.(*threadSafeMap).Add(0xc0002f3740, 0xc0005aef60, 0x2e, 0x18277e0, 0xc0006d0580)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/thread_safe_store.go:78 +0x145
k8s.io/client-go/tools/cache.(*cache).Add(0xc00009e060, 0x18277e0, 0xc0006d0580, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/store.go:150 +0x105
k8s.io/client-go/tools/cache.newInformer.func1(0x168fbc0, 0xc0006abf20, 0x1, 0xc0006abf20)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:404 +0x15b
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000371220, 0xc0002f37d0, 0x0, 0x0, 0x0, 0x0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/delta_fifo.go:507 +0x322
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0001d22d0)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:183 +0x42
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000585f90)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006f5f90, 0x1a1af20, 0xc000356000, 0xc0000ce001, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000585f90, 0x3b9aca00, 0x0, 0xc00032a401, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/circleci/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run(0xc0001d22d0, 0xc00007e420)
        /home/circleci/go/pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/controller.go:154 +0x2e5
created by go.universe.tf/metallb/internal/k8s.(*Client).Run
        /home/circleci/project/internal/k8s/k8s.go:416 +0x395

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: This same issue has been mentioned in https://github.com/metallb/metallb/issues/1175 and in https://github.com/submariner-io/submariner/issues/1869.

Environment: We're using submariner to connect three clusters, two K3s single node cluster and one kubeadm single node cluster. We consider the K3s clusters as cluster-a and cluster-b, and the kubeadm cluster as cluster-c. cluster-a is where the submariner broker is deployed. All the problems mentioned on this post are happening on cluster-c.

Skipping inter-cluster firewall check as it requires two kubeconfigs. Please run "subctl diagnose firewall inter-cluster" command manually.

nyechiel commented 2 years ago

@yboaron did you have a chance to look into this?

yboaron commented 2 years ago

@yboaron did you have a chance to look into this?

Not yet, it's on my to-do list, hope to get to it in the next 1-2 weeks.

yboaron commented 2 years ago

Hi @Prophetick , I tried to reproduce this issue on Kind environment but was unable to reproduce it (MetalLB speaker pods restart).

To deploy Submariner (latest devel) + MetalLB (version 0.13.5) on Kind, I downloaded the latest https://github.com/submariner-io/submariner-operator and run : make deploy using=load-balancer,lighthouse

I run both Submariner e2e tests (subctl verify ) and also manual verification while I can see that endpointsslice resource generated by Submariner Lighthouse [1] do not include the kubernetes.io/service-name label but include the multicluster.kubernetes.io/service-name label [1] the MetalLB speaker pods didn't crash/restart.

Is the problem still relevant in the latest stable version of MetalLB? If so, could you please describe how to reproduce it?

[1]

$ kubectl --kubeconfig output/kubeconfigs/kind-config-cluster2 get endpointslice nginx-cluster2 -o yaml addressType: IPv4 apiVersion: discovery.k8s.io/v1 endpoints:

goncalopcarvalho commented 2 years ago

Hi all, thanks for the reply.

When I opened this issue I was using Submariner 0.12.2 and MetalLB 0.12.1.

I can confirm that upgrading Submariner to version 0.13.1 and MetalLB to version 0.13.5 solved the problem for me, I can no longer reproduce this.

nyechiel commented 2 years ago

Closing as per the last comment.

Woytek-Polnik commented 1 year ago

It does not seem to be solved by Submariner, but ignored silently by MetalLB, does not seem to be solved. Have 0.15.2 Submariner and OCP MetalLB 0.10 The latest OCP MetalLB is 0.11

vthapar commented 1 year ago

@Woytek-Polnik EndpointSlices created by Lighthouse are required to have label multicluster.kubernetes.io/service-name as per the KEP-1645. kubernetes.io/service-name is for endpointslices created by kubernetes for local services. Different labels are to distinguish between lcoal endpointslices and Exported service's EndpointSlices.

This looks like a bug in MetalLB. It needs to be Multicluster aware. Assumption that all EndpointSlices must have kubernetes.io/service-name is incorrect. If MetalLB is only interested in local services and hteir EndpointSlices, it should ignore EndpointSlices with the multicluster label. If it is interested in multicluster Endpoint slices, it should honor the multicluster label and use that.

Woytek-Polnik commented 1 year ago

@vthapar Thank you for the extra context. Totally makes sense to me to just ignore them. In that case, I have to push the MetalLB operator in OCP to be ⬆️

fedepaol commented 1 year ago

I landed on this thread by chance following the comment to https://github.com/metallb/metallb/issues/1175

MetalLB is not multicluster aware, and crashing in that scenario is certainly not the solution. We can make it more robust (and it was already done, incidentally?) I am not sure what the behaviour should be because I need to read the KEP and the related CNI implications (i.e. what happens if the traffic directed to a LB lands to a node that belongs to a different cluster than the cluster the service is defined on, if the service is mirrored to all the clusters, etc) so I'd split the fix in two: avoid crashing (if we still do) and ignore those ep slices as a bug fix, and managing the multi cluster scenario as an enhancement (if it makes sense).

This is from community metallb point of view. @Woytek-Polnik, if you are eligible for support on openshift please reach RH through the proper channels.