solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.08k stars 437 forks source link

Gloo Fed - gloo-fed pod crashes when not able to connect to the remote context #7072

Open bcollard opened 2 years ago

bcollard commented 2 years ago

Gloo Edge Version

1.12.x (latest stable)

Kubernetes Version

1.23.x

Describe the bug

Given a gloo-fed deployment running on cluster 1, A Gloo Edge deployment running on cluster 2, After I register cluster 2 with glooctl cluster register ... If the api-server address that is set in the secret associated with the KubernetesCluster CR is not reachable, Then the gloo-fed pod will crash with the following error message:

127.0.0.1:52569: connect: connection refused","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/manager/manager.go:312\ngithub.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1.1\n\t/go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:99\ngithub.com/avast/retry-go.Do\n\t/go/pkg/mod/github.com/avast/retry-go@v2.4.3+incompatible/retry.go:103\ngithub.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1\n\t/go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:97"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1b15c00]

goroutine 478 [running]:
github.com/solo-io/skv2/pkg/reconcile.(*runner).RunReconciler(0xc000f88000, {0x38be160, 0xc00078a080}, {0x38883e0, 0xc000d38000}, {0x0, 0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/reconcile/runner.go:96 +0x80
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*reconcilerList).runAll(0xc0005c8660, {0xc00104a6e0, 0xd}, {0x3888420, 0xc000f88000})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:159 +0x2a6
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*clusterLoopSet).ensureReconcilers(0xc0005dc400, 0xc0005c8660)
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:109 +0x17a
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*clusterLoopRunner).AddCluster(0xc0008516d0, {0x38be160, 0xc000e3d2c0}, {0xc00104a6e0, 0xd}, {0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:48 +0x1d5
github.com/solo-io/skv2/pkg/multicluster/watch.(*handlerList).AddCluster(0xc0005c8420, {0x38be160, 0xc000e3d2c0}, {0xc00104a6e0, 0xd}, {0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:230 +0x1c4
github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1.1()
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:103 +0x3f6
github.com/avast/retry-go.Do(0xc00107bfa8, {0xc0007f8790, 0x3, 0x3})
    /go/pkg/mod/github.com/avast/retry-go@v2.4.3+incompatible/retry.go:103 +0x2a5
github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1()
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:97 +0x188
created by github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:90 +0xdb

On the gloo-fed cluster, the KubernetesCluster CR references a Secret with something similar to: kubectl -n gloo-system get secret my-remote-cluster -o jsonpath="{.data.kubeconfig}" | base64 -d

apiVersion: v1
clusters:
- cluster:
    insecure-skip-tls-verify: true
    server: https://<unresolvable address>:<some port>
  name: kind-2-remote
...

Steps to reproduce the bug

See above

Expected Behavior

Expecting no crash at all. Updating the KubernetesCluster resource with a 'FAILED' status would be better.

The current workaround is to force another api-server address with the command-line parameter --local-cluster-domain-override But without this fix, the gloo-fed pod is crashing.

Additional Context

No response

bcollard commented 2 years ago

Another error reported by a customer, after he ran glooctl cluster register .. and he got the KubernetesCluster created but not the GlooInstance CR:

{"level":"error","ts":1661851560.0879245,"logger":"gloo-fed","msg":"Failed to get API Group-Resources","version":"1.12.9","error":"the server has asked for the client to provide credentials","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/manager/manager.go:312\ngithub.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1.1\n\t/go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:99\ngithub.com/avast/retry-go.Do\n\t/go/pkg/mod/github.com/avast/retry-go@v2.4.3+incompatible/retry.go:103\ngithub.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1\n\t/go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:97"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1b15c00]
goroutine 581 [running]:
github.com/solo-io/skv2/pkg/reconcile.(*runner).RunReconciler(0xc003b6d320, {0x38be160, 0xc0005d46c0}, {0x38883e0, 0xc000698a40}, {0x0, 0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/reconcile/runner.go:96 +0x80
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*reconcilerList).runAll(0xc000eca270, {0xc004eb9a88, 0x8}, {0x3888420, 0xc003b6d320})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:159 +0x2a6
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*clusterLoopSet).ensureReconcilers(0xc000eb9080, 0xc000eca270)
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:109 +0x17a
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*clusterLoopRunner).AddCluster(0xc000ec45a0, {0x38be160, 0xc0003ec580}, {0xc004eb9a88, 0x8}, {0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:48 +0x1d5
github.com/solo-io/skv2/pkg/multicluster/watch.(*handlerList).AddCluster(0xc000eca030, {0x38be160, 0xc0003ec580}, {0xc004eb9a88, 0x8}, {0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:230 +0x1c4
github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1.1()
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:103 +0x3f6
github.com/avast/retry-go.Do(0xc003b75fa8, {0xc000093f90, 0x3, 0x3})
    /go/pkg/mod/github.com/avast/retry-go@v2.4.3+incompatible/retry.go:103 +0x2a5
github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1()
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:97 +0x188
created by github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:90 +0xdb
Stream closed EOF for gloo-fed/gloo-fed-67566bcd69-qq448 (gloo-fed)
bcollard commented 2 years ago

One more with a local cluster (KinD)

{"level":"error","ts":1662126113.1518774,"logger":"gloo-fed","msg":"Failed to get API Group-Resources","version":"1.12.9","error":"specifying a root certificates file with the insecure flag is not allowed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/manager/manager.go:312\ngithub.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1.1\n\t/go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:99\ngithub.com/avast/retry-go.Do\n\t/go/pkg/mod/github.com/avast/retry-go@v2.4.3+incompatible/retry.go:103\ngithub.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1\n\t/go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:97"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1b15c00]

goroutine 233 [running]:
github.com/solo-io/skv2/pkg/reconcile.(*runner).RunReconciler(0xc000adb020, {0x38be160, 0xc000081880}, {0x38883e0, 0xc0000c3520}, {0x0, 0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/reconcile/runner.go:96 +0x80
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*reconcilerList).runAll(0xc0009a37a0, {0xc0004870d0, 0xd}, {0x3888420, 0xc000adb020})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:159 +0x2a6
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*clusterLoopSet).ensureReconcilers(0xc000a2a4c0, 0xc0009a37a0)
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:109 +0x17a
github.com/solo-io/skv2/pkg/multicluster/reconcile.(*clusterLoopRunner).AddCluster(0xc000b7b4a0, {0x38be160, 0xc0006bc400}, {0xc0004870d0, 0xd}, {0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/reconcile/reconcile.go:48 +0x1d5
github.com/solo-io/skv2/pkg/multicluster/watch.(*handlerList).AddCluster(0xc0009a3560, {0x38be160, 0xc0006bc400}, {0xc0004870d0, 0xd}, {0x0, 0x0})
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:230 +0x1c4
github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1.1()
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:103 +0x3f6
github.com/avast/retry-go.Do(0xc00003bfa8, {0xc00085ff90, 0x3, 0x3})
    /go/pkg/mod/github.com/avast/retry-go@v2.4.3+incompatible/retry.go:103 +0x2a5
github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager.func1()
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:97 +0x188
created by github.com/solo-io/skv2/pkg/multicluster/watch.(*clusterWatcher).startManager
    /go/pkg/mod/github.com/solo-io/skv2@v0.21.7/pkg/multicluster/watch/watcher.go:90 +0xdb

where the workaround is to use this sort of insecure connection settings:

- cluster:
    certificate-authority-data: ""
    server: https://host.docker.internal:<api-server port>
    insecure-skip-tls-verify: true

We need better error handling.

github-actions[bot] commented 4 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.