linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.51k stars 1.27k forks source link

Linkerd CNI pods not aware about the OIDC signing key auto-rotation by AKS| #12573

Open Peeyush1989 opened 2 months ago

Peeyush1989 commented 2 months ago

What is the issue?

We are using a private AKS cluster version 1.26.x, We have configured linkerd stable version 2.14.2 with linkerd-cni enabled.

The AKS cluster is enabled with OIDC which is designed to to auto rotate the signing keys periodically.

After the OIDC keys were auto rotated, all the new pods were getting stuck with following error

“FailedCreatePodSandBox (x556 over ) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3756782430d4016076288c700b871e4325ca2d5d6bdd7a422697c7d3b54d23e6": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized”

After restarted the linkerd-cni daemonset were were able to deploy the new pods but the existing pods in the linkerd meshed namespace started giving invalid certificate errors and pods inter communication was impacted.

We checked the issuer certificate and it was valid. We had to redeploy linkerd to get rid of this issue

Need to you help in troubleshooting linkerd issues with OIDC

How can it be reproduced?

we need to manual auto rotated the oidc signing keys in new infra to reproduce this issues.__

Logs, error output, etc

Linkerd control plane

[ 0.105506s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.306969s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 0.710647s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 1.211775s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 1.713047s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 2.215585s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 2.716391s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 3.217705s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

output of linkerd check -o short

N/A

Environment

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

yes