Destination controller goes into failfast mode

bjoernw commented 1 year ago

What is the issue?

We use canaries with ArgoRollout via SMI meaning there is a lot of endpointslice churn due to frequent release trains and their 10-15min canary rollouts. (~300-400 meshed pods)

Occasionally we will see the destination service go into failfast mode, which is problematic during a canary deployment because old IPs of pods that were already torn down by ArgoRollout are still receiving traffic even though they should not. To remedy this we need to restart the destination deployment.

How can it be reproduced?

This happens intermittently under load. We run 5 instances of each control plane component.

Logs, error output, etc

The following are all logs from destination pods:

[ 18093.017333s]  WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed

[951475.788267s]  WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

[950833.887787s]  WARN ThreadId(01) inbound:server{port=8086}:controller{addr=localhost:8086}: linkerd_stack::failfast: Service entering failfast after 1s

[950842.896953s]  WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s

[ 18093.017325s]  WARN ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.243.33:34358}: linkerd_app_inbound::http::server: Unexpected error error=client 10.30.243.33:34358: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]

[     3.751136s]  WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

[ 14118.108714s]  WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107) error.sources=[Transport endpoint is not connected (os error 107)]

[ 18093.017344s]  INFO ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.156.57:57528}: linkerd_app_core::errors::respond: gRPC request failed error=client 10.30.156.57:57528: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]

output of `linkerd check -o short`

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on --------------
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.6
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane and cli versions match
    control plane running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
    linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies and cli versions match
    linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

v1.24.16-eks
linkerd: 2.13.6

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

bjoernw commented 1 year ago

Just hit this issue again with Connection refused (os error 111) on the destination controller that caused a 2min outage that ultimately resolved itself because i have a diagnostic script that monitors the destination controller in particular. Any update on the triaging of this issue @adleong ?

bjoernw commented 1 year ago

Another data point: I have a diagnostics script that runs linkerd diagnostics against every destination pod every minute and diffs the IPs it knows about with the IPs that the k8s endpointslices controller knows about for 5 different services. Looks like the diagnostics command timed out in this instance (which causes the script to restart the offending destination pod).

Maybe some race condition that locks up the entire destination controller process? Only 5 of the 10 pods had this issue and were rebooted by the script

adleong commented 1 year ago

Hi @bjoernw, sorry for the delay in responding to this. There are a couple of things that would be super helpful for digging deeper into what's going on here.

I see that you've included some of the proxy logs from the destination controller pod. It would also be great to see logs from the destination container in that pod and the policy container in that pod as well. Wherever possible, getting full logs rather than snippets will help establish context. It would also be great to see the controller metrics, which can be retrieved by running linkerd diagnostics controller-metrics.

omidraha commented 1 year ago

I think I have similar issue. And also I added the output of this linkerd diagnostics controller-metrics command.

adleong commented 1 year ago

Thanks for all your patience with this issue, @bjoernw. I've looked at the logs that you sent me, and one thing that I noticed, is that the failfast errors in the destination controller only seem to start happening after that pod has received a sigterm signal.

I believe that what is likely happening is that when the pod is sent a sigterm, the destination controller and policy controller containers shut down before the proxy does. When the proxy gets inbound requests for these controllers which are no longer available, it returns a failfast error.

Pod shutdown ordering is a complex topic and there has been some debate about how to change the way we handle graceful shutdown here: https://github.com/linkerd/linkerd2/issues/10379

With all of that said, this should not affect your application traffic. If a data-plane proxy gets a 503 error from the control plane (because that control plane pod is in the process of shutting down), it should retry it and get load balanced to a different destination controller pod (which is hopefully not in the process of shutting down).

I want revisit some assumptions from the original description here. You mention that there is a endpoint staleness condition where traffic gets routed to pods which no longer exist. You also mention failfast errors in the destination controller proxy logs and restarting the destination controller. Are you sure that you have the causality in the right direction? Is it possible that when you encounter endpoint staleness, you restart the destination controller, and during the course of shutting down, the destination controller proxy prints a bunch of failfast error logs?

bjoernw commented 1 year ago

I had to write this diagnostic script because this issue keeps happening so it's not the cause. I actually don't see my script actively terminating these pods. It must be the health check timing out that's causing k8s to terminate the pod.

olix0r commented 1 year ago

@bjoernw I think your theory about health checks failing is worth looking into in more detail. Or more generally determining why the kubelet is choosing to kill these pods...

Does kubectl describe indicate that health checks have failed? For which ports? Are there other events that might indicate why Kubernetes is terminating these processes?

thjwhite commented 1 year ago

Wanted to chime in with my team's experience. Not completely sure if this is the same issue or not, but it look very similar. We occasionally observe application pods going into a loop where we see a large number of 5xx errors. Our cluster runs on EKS, linkerd 2.14.1 . Log Lines from application pod's linkerd-proxy:

Starts at 2023-10-07 01:46:32

[ 1577.793693s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:service{ns=yy name=xx port=50051}:endpoint{addr=10.43.52.17:50051}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

[ 5.540692s] INFO ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:rescue{client.addr=10.43.26.109:57910}: linkerd_app_core::errors::respond: gRPC request failed error=logical service xx.yy.svc.cluster.local:50051: service in fail-fast error.sources=[service in fail-fast] [ 5.540438s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}: linkerd_stack::failfast: Service entering failfast after 3s

The cluster in question has about 2000, to 3000 pods in the mesh at load. The applications all autoscale, so pods are coming in and out constantly throughout the day. Normally, everything is working great, but it occasionally get's stuck and starts reporting the above errors for every request. Restarting the destination linkerd pods does fix the issue. Interesting also, is that restarting the individual pods having connection issues also temporarily fixes the issue, although for much shorter period of time. For example, restarting linkerd-destination fixes the issue for a few days or a week. Restarting individual failing pods, fixes the issue for a few minutes or hours.

Logs from the destination controller. The logs timestamps are not really correlated and come after the pod application errors from above Starts at 2023-10-07 05:55:47

[ 4.771258s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 4.269445s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

There is one interesting log that does come at around the same time as the errors: Timestamp 2023-10-07 01:27:04

time="2023-10-07T08:27:04Z" level=error msg="failed to send protocol update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=server ns=zz pod=cc

olix0r commented 1 year ago

@thjwhite

Wanted to chime in with my team's experience. Not completely sure if this is the same issue or not, but it look very similar. We occasionally observe application pods going into a loop where we see a large number of 5xx errors. Our cluster runs on EKS, linkerd 2.14.1 .

Thank you for the additional data!

Log Lines from application pod's linkerd-proxy:

Starts at 2023-10-07 01:46:32

[ 1577.793693s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:service{ns=yy name=xx port=50051}:endpoint{addr=10.43.52.17:50051}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

[ 5.540692s] INFO ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}:rescue{client.addr=10.43.26.109:57910}: linkerd_app_core::errors::respond: gRPC request failed error=logical service xx.yy.svc.cluster.local:50051: service in fail-fast error.sources=[service in fail-fast] [ 5.540438s] WARN ThreadId(01) outbound:proxy{addr=172.20.250.111:50051}: linkerd_stack::failfast: Service entering failfast after 3s

Are these logs from multiple pods or a single pod? Specifically, the first quoted log occurred ~26mins after the proxy started and the second quoted log snippet occurred ~5s after the proxy started.

In any case, seeing a failfast timeout 5s after pod startup strongly indicates that the control plane did not return any actively available endpoints to the data plane.

The cluster in question has about 2000, to 3000 pods in the mesh at load. The applications all autoscale, so pods are coming in and out constantly throughout the day. Normally, everything is working great, but it occasionally get's stuck and starts reporting the above errors for every request. Restarting the destination linkerd pods does fix the issue. Interesting also, is that restarting the individual pods having connection issues also temporarily fixes the issue, although for much shorter period of time. For example, restarting linkerd-destination fixes the issue for a few days or a week.

This is helpful context.

Restarting individual failing pods, fixes the issue for a few minutes or hours.

This is surprising given that the above log indicated a proxy in this state only 5s after startup. That is, the initial lookup the proxy performed would appear not to include any available endpoints.

It sounds like you're running multiple destination controller replicas? Perhaps one replica is getting into a bad state while the others remain healthy; so as you restart data plane workloads, resolutions randomly hit controller pods and the problem doesn't manifest.

Logs from the destination controller. The logs timestamps are not really correlated and come after the pod application errors from above Starts at 2023-10-07 05:55:47

[ 4.771258s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)] [ 4.269445s] WARN ThreadId(01) watch{port=4191}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

These logs are benign and occur during startup (we see this happened 4s after the proxy started), as the proxy attempts to connect to the local policy controller, which may take a few seconds to begin serving its gRPC server. We probably ought to find a way to avoid logging warnings during startup...

There is one interesting log that does come at around the same time as the errors: Timestamp 2023-10-07 01:27:04

time="2023-10-07T08:27:04Z" level=error msg="failed to send protocol update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=server ns=zz pod=cc

This typically indicates that a client canceled a resolution (i.e. due to a restart, etc). These are fairly typical and probably shouldn't be logged at error level...

The information you've provided lines up closely with similar reports we have received. We have been actively trying to reproduce this problem with a variety of testing strategies and have so far been unsuccessful.

Also, we are going to put together a plan to get higher fidelity diagnostics about the destination controller's state.

We'll provide updates on this issue as this work progresses.

thjwhite commented 1 year ago

Glad to help. if there is any additional diagnostics we could run on our cluster, I'd be happy to provide that.

To answer your questions:

Are these logs from multiple pods or a single pod? Specifically, the first quoted log occurred ~26mins after the proxy started and the second quoted log snippet occurred ~5s after the proxy started. In any case, seeing a failfast timeout 5s after pod startup strongly indicates that the control plane did not return any actively available endpoints to the data plane.

I should have included more info on the pods. Yes, they are from separate pods. I had pulled the logs from our central logging system filtering on our application and the linkerd-proxy container so I think they were interleaved. One thing we had brainstormed when talking about the issue internally, is it might be an issue with the K8s/EKS control plane itself? Like if the k8s control plane had an increased error rate, latency, or connection errors, maybe the destination controller also fails in an unexpected way? We have a ticket open with our AWS contacts to see if we can get any more info on what happened on that side of things during the outage. We have had issues in the past where an unexpected high load on EKS/k8s control plane causes API errors at the k8s level, which can effect things like operators, CI/CD, and anything using the k8s apis.

It sounds like you're running multiple destination controller replicas? Perhaps one replica is getting into a bad state while the others remain healthy; so as you restart data plane workloads, resolutions randomly hit controller pods and the problem doesn't manifest.

Yes we run 3 replicas of destination controller. One other thing to note, is that the linkerd controller pods all were "healthy" and in ready state at the time of the outage. If it is getting into a bad state, health checks were not flagging it in our cluster.

This typically indicates that a client canceled a resolution (i.e. due to a restart, etc). These are fairly typical and probably shouldn't be logged at error level...

Good to know that! I figured I'd include it since it stood out due to the timing of it.

olix0r commented 1 year ago

@thjwhite

The applications all autoscale, so pods are coming in and out constantly throughout the day.

Are you scaling nodes in this situation, too? or is this in a fixed node pool?

We've kicked off a few scale/burn-in tests to exercise aggressive service churn and I'm curious if node churn is a part of the equation for you as well.

thjwhite commented 1 year ago

Yes, we are scaling up/down nodes as well.

olix0r commented 1 year ago

We have a theory that we're chasing down and we've identified a prometheus query that can help highlight this problem:

sum (endpoints_updates{component='destination'}) by(instance, service, port)

If you don't have a prometheus setup, you can approximate this by faking a timeseries by snapshotting the control plane's state a few times:

date && linkerd diagnostics controller-metrics && sleep 60 && date && linkerd diagnostics controller-metrics

We expect that all service updates should flat-line for a single destination controller instance. Something like:

We also expect that other destination controller instances should not experience any directly correlated failures of this nature.

We're still working to confirm our hypothesis and to ultimately reproduce this issue, but in the meantime capturing data from incidents of this failure will most effectively help to confirm or disprove our theory.

hawkw commented 11 months ago

The latest edge release, edge-23.10.3, contains changes to the proxy and Destination controller which should resolve these issues. We'd love your help validating this fix by trying out edge-23.10.3 in your environments and letting us know if you encounter any more use of stale endpoints. Thanks!

DavidMcLaughlin commented 11 months ago

We are confident that destination controller staleness issues due to a deadlock condition are now available in stable-2.14.2. Please feel free to reopen if you encounter this issue after upgrading.

hadican commented 11 months ago

@DavidMcLaughlin Could you please apply it to 2.13 if applicable?

bjoernw commented 11 months ago

Thanks, Team!

linkerd / linkerd2