Open zack-littke-smith-ai opened 4 months ago
Hi @zack-littke-smith! I'd recommend looking at the full client proxy logs, beyond those two log lines in particular. The Linkerd proxy will log when addresses are added to its load balancers so the first thing I'd look into is if the correct addresses for service-name.namespace.svc.cluster.local:10079
have been added to the client proxy's load balancer.
Before we see errors, we have the following client logs:
[ 0.001866s] INFO ThreadId(01) linkerd2_proxy: release 2.210.0 (85db2fc) by linkerd on 2023-09-21T21:24:58Z
[ 0.002681s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.003389s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.003426s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.003430s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.003432s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.003434s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 0.003436s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.003438s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.015661s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=default.namespace.serviceaccount.identity.linkerd.cluster.local
[ 6.059523s] WARN ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}: linkerd_stack::failfast: Service entering failfast after 3s
// First error here:
[ 6.059608s] INFO ThreadId(01) outbound:proxy{addr=10.100.32.3:10079}:rescue{client.addr=172.28.157.182:50400}: linkerd_app_core::errors::respond: gRPC request failed error=logical service simian-config.namespace.svc.cluster.local:10079: service in fail-fast error.sources=[service in fail-fast]
We also see the following additional failures which I didn't notice before and didn't link above:
[ 89.602769s] WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed
Ah, the proxy logging that I referred to was added after stable-2.14.1. If you upgrade to a recent edge release, you'll have more informative proxy logging about the state of the load balancer and why the service is entering fail-fast.
I'll look into getting us more up-to-date and come back around. Is there anything else about our setup that stands out to you as being problematic, or anything else I should turn on in the meantime? This issue is quite rare for us and so cycling back here might be slow
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
What is the issue?
I am running into a really difficult-to-reproduce issue where our k8s pod will somehow decide it will not serve certain clients, giving logs in the client proxy:
And:
However during this time, the service does successfully connect to other clients and serve their requests descriminately. Restarting the clients has no effect, and restarting the service can 'sometimes' help, resulting in reconnection to some clients but failure to reconnect to others.
The only 'solution' we've seen success with is restarting every single linkerd container and proxy-having service, which is not ideal to say the least.
While I have no solid repro, I'm hoping to at least take away some debugging tips for the next time this happens to us.
How can it be reproduced?
Unfortunately I have not been able to reliably reproduce in our own environments
Logs, error output, etc
Proxy logs from the service:
Logs from the client proxy included above
output of
linkerd check -o short
Environment
linkerd_controller: stable-2.14.1 linkerd_debug: stable-2.14.1 linkerd_grafana: stable-2.11.1 linkerd_metrics_api: stable-2.14.1 linkerd_policy_controller: stable-2.14.1 linkerd_proxy: stable-2.14.1 linkerd_proxy_init: v2.2.3 linkerd_tap: stable-2.14.1 linkerd_web: stable-2.14.1
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None