linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.69k stars 1.28k forks source link

Outbound HTTP call failing intermittently #11055

Closed vkperumal closed 1 year ago

vkperumal commented 1 year ago

What is the issue?

Outbound HTTP calls to application fails with connection closed error intermittently. Source application from which HTTP calls failing is a scheduler application which makes regular calls to other application based on schedule. After the failure, the next schedule works fine. Please see the logs below from linkerd-proxy

How can it be reproduced?

Linkerd Version: 2.13.4

Logs, error output, etc

[129664.685446s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39860 [129664.691331s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39872 [129664.691557s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39878 [129664.692554s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39894 [129664.695277s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39900 [129784.342535s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42724 [129784.342604s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42738 [129784.342642s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42746 [129784.342682s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42760

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-06-25T04:48:41Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:45Z
    see https://linkerd.io/2.13/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:46Z
    see https://linkerd.io/2.13/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:45Z
    see https://linkerd.io/2.13/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-5b59946658-pm2wk (stable-2.13.4)
    * linkerd-identity-54dfb7648b-hhp2r (stable-2.13.4)
    * linkerd-proxy-injector-5fc4795598-4h68n (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:22:23Z
    see https://linkerd.io/2.13/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * metrics-api-d6665b56-7hs59 (stable-2.13.4)
    * tap-6d45c9cd5c-wj2ft (stable-2.13.4)
    * tap-injector-6c95fb4f97-q945w (stable-2.13.4)
    * web-7c9756b6b-4bcvw (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

someone-stole-my-name commented 1 year ago

Since our upgrade to 2.13.4 we've been experiencing the same errors in our cluster which result in transient 503 returned to the clients. It looks like the same or very similar problem reported in https://linkerd.buoyant.io/t/503-service-unavailable-from-proxy-with-large-number-of-connections/126. It doesn't affect any service specifically, the whole mesh is affected randomly during peak load. There's no throttling or any resource constraint issues, we're not using service profiles, Linkerd's circuit breaking or dynamic routing.

image

The log of the affected clients is filled with information, but I don't see an answer as to why the proxy is marking services as unavailable:

2023-06-28T10:52:48Z    [ 94126.531685s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36056}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z    [ 94126.531675s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z    [ 94126.531668s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "3081", "content-type": "application/json", "traceparent": "00-083c196cf720aab5452406dfa155fda3-19bc1b8cdb26cebd-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z    [ 94126.531665s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z    [ 94126.531657s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17df00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 43, rx_fields: "..." } } } })
2023-06-28T10:52:48Z    [ 94126.531641s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z    [ 94126.531634s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "22912", "content-type": "application/json", "traceparent": "00-23856363099ba7654f627b8ab7e2905a-9328a2c69cd37c13-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z    [ 94126.531630s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z    [ 94126.531620s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 23, rx_fields: "..." } } } })
2023-06-28T10:52:48Z    [ 94126.531598s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z    [ 94126.531564s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "94320", "content-type": "application/json", "traceparent": "00-a638ca610eaefb86a771ebaaa411f072-9b465174c9c12082-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z    [ 94126.531559s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z    [ 94126.531541s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17da00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 24, rx_fields: "..." } } } })
2023-06-28T10:52:48Z    [ 94126.531466s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z    [ 94126.531444s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
2023-06-28T10:52:48Z    [ 94126.531357s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z    [ 94126.531314s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The stack is tearing down the connection
2023-06-28T10:52:48Z    [ 94126.531265s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35784}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z    [ 94126.531237s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z    [ 94126.531232s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z    [ 94126.531214s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z    [ 94126.531173s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z    [ 94126.531165s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z    [ 94126.531148s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z    [ 94126.531121s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z    [ 94126.531109s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z    [ 94126.531105s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "52431", "content-type": "application/json", "traceparent": "00-8c14a06804f62ea8bb6f352ee3f4f0e4-4acb6b9ebed60214-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z    [ 94126.531102s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z    [ 94126.530991s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 22, rx_fields: "..." } } } })
2023-06-28T10:52:48Z    [ 94126.530792s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_app_core::errors::respond: Handling error on HTTP connection status=503 Service Unavailable version=HTTP/1.1 close=true
2023-06-28T10:52:48Z    [ 94126.530791s] DEBUG ThreadId(02) outbound: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z    [ 94126.530785s]  INFO ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http:rescue{client.addr=172.16.85.145:52548}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.109.178:80: service unavailable error.sources=[service unavailable]
2023-06-28T10:52:48Z    [ 94126.530775s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service shedding load
2023-06-28T10:52:48Z    [ 94126.530767s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service has become unavailable

Metrics are not helpful either, according to them, everything is working fine, but it isn't. The proxy is not adding those 503 as failed requests, they're not visible in viz or Prometheus. The only real clue comes from transport level metrics, where we see the same loadshed errors visible in debug logs, example from the same client pod as the logs:

image

Sadly I don't have reproduction instructions, but something must have changed between 2.12 and 2.13 that causes this odd behaviour.

MarkSRobinson commented 1 year ago

@someone-stole-my-name What version were you running before?

vkperumal commented 1 year ago

@MarkSRobinson Issues started for me after upgrading from 2.11.1 to 2.13.4. Issue still exists on 2.13.5 as well

someone-stole-my-name commented 1 year ago

@someone-stole-my-name What version were you running before?

2.12.4

We've reverted just the data plane to the previous version and things are stable again, so far. We replaced some core components first, and in the rollback process noticed something that may give some clues on where to look. Reverting just the outbound proxies doesn't fix the issue, you still get the slightly different errors from v2.12 but same outcome, service marked as unavailable and 503 returned to the client:

2023-06-28T15:32:28Z    [   512.908510s]  WARN ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_reconnect: Service failed error=channel closed
2023-06-28T15:32:28Z    [   512.908502s]  WARN ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_outbound::http::server: Unexpected error error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z    [   512.908498s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z    [   512.908460s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z    [   512.854087s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.854013s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.854006s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853958s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853927s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853877s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853843s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853806s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853762s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853718s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853714s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853647s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853594s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853582s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853507s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z    [   512.853455s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:25:42Z    [   106.568334s]  WARN ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_reconnect: Service failed error=channel closed
2023-06-28T15:25:42Z    [   106.568300s]  WARN ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_outbound::http::server: Unexpected error error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z    [   106.568295s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z    [   106.568265s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z    [   106.525745s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:38646: server: 172.16.64.39:8501: server 172.16.64.39:8501: service service-b.service-b.svc.cluster.local:80: service unavailable

Reverting the inbound proxy to 2.12 apparently fixes this issue. Here are some logs from the inbound side while still using 2.13.4. Notice how the pattern changes around 2023-06-28T15:32:28Z [195845.387669s], then all connections are dropped and established again around 2023-06-28T15:32:28Z [195845.401767s].

alpeb commented 1 year ago

@vkperumal Can you tell if the target services are under high load when those errors occur, like reported by @someone-stole-my-name ?

vkperumal commented 1 year ago

@alpeb My application is a scheduler which makes outbound http calls to other apps. There is no load on the application

juhomakitalo commented 1 year ago

Having same errors after upgrading to 2.13.5 and only load was communication between the services in our application. Errors were fixed after downgrading to 2.12.5. (For some reason our external prometheus metrics also stopped working after upgrading linkerd-viz to 2.13.5 but started working again after downgrading to 2.12.5)

tomhaddad commented 1 year ago

Also experiencing the same issue - seems it was introduced in 2.13.0, as I have switched between 2.12.5 / 2.13.0 today and seen no issues in the former.

It is also very specifically occurring when making over 100 concurrent outgoing requests. Any less and everything works as normal, any more (even 101) cause the the same errors described in the original issue and crash the respective client pod.

ali-accesso commented 1 year ago

We also experiencing unpredictable connections issue between our services since upgrading from 2.12.5 to 2.13.5 - no heavy load on our apps at the moment, and still can't figure out what could cause this issue.

WARN ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_outbound::http::server: Unexpected error error=logical service 10.0.187.210:80: default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready, endpoint 10.1.17.12:80: operation was canceled: connection was not ready, operation was canceled: connection was not ready, connection was not ready]
[  7750.378237s]  INFO ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 10.0.187.210:80: default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready, endpoint 10.1.17.12:80: operation was canceled: connection was not ready, operation was canceled: connection was not ready, connection was not ready]
INFO ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:service{ns= name=service port=0}:endpoint{addr=10.1.17.12:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_core::errors::respond: gRPC request failed error=endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[operation was canceled: connection was not ready, connection was not ready]
bjoernw commented 1 year ago

I'm having this same issue on a cluster with several hundred meshed pods. I'm on 2.13.4. I worry that it's an out-of-date endpoints cache issue that is causing the whole service to be deemed unavailable. The only way to recover from this issue is to restart all the proxies that are logging this issue.

jandersen-plaid commented 1 year ago

I worry that it's an out-of-date endpoints cache issue

Similar to other posters, we have run into this issue and reported it elsewhere (see https://linkerd.buoyant.io/t/linkerd-destination-has-stale-data-in-endpoint-cache/171). @bjoernw, we also think it is a stale endpoints cache, but we found that we could recover from incidences of it by restarting the destination controllers (presumably this sends a RESET to the affected proxies, and that clears their caches?).

We could not reproduce it after ~2 weeks of trying, but our suspicion is that what happens is:

  1. An endpoint or discovery cache (distribute's BackendCache maybe?) in the proxy fills up with balancer endpoints that are pending (whether through a cache without aggressive enough eviction or some other reason).
  2. Some failure causes the proxy to find all endpoints in a pending state, and the service moves into a loadshed state after going into failfast.
  3. When the proxy goes into a loadshed it starts to consider alternate backends & endpoints that are in the affected cache.
  4. Those balancer endpoints that are in pending are then each considered to try and find one available.
  5. Somehow the appropriate endpoints // the ready endpoints are never hit upon or evicted from the cache????
  6. This causes the affected proxies to never be able to route requests to ready endpoints, so the loadshedding // failfast never recovers
  7. Removing all affected proxies forces the endpoints to be refreshed, or restarting the destination control plane forces a RESET of the endpoint caches in the proxies, which moves the state out of loadshed // failfast

Of course, this is all conjecture: without a reproduction it is really hard to prove any of this is happening or get the necessary log info in order to track down which cache is failing in what way. Since we don't by default log debug level, all we have are our metrics which indicate that the outbound_http_balancer_endpoints had large number of pending endpoints during the incident, some correlation that stale IPs were being accessed around the time of the incident (from our CNI) which resolved directly after we restarted the destination controller, and a huge rise in loadshed errors from outbound_http_errors_total

Correlation between loadshedding behavior and stale ips (note that stale ips were being accessed before the major errors start, which indicates that some bad endpoints in the cache were being accessed on a relatively frequent basis, but not considered for routing http requests until something -- a rise in latency or a few 504s -- triggered failfast // loadshedding behavior):

image

Relationship between loadshedding errors and endpoints:

image

(it is worth noting that the divot in the graph is where we attempted to restart the affected inbound proxies // pods, but were unable to recover from the situation as we did not restart the nginx pods which were sending the outbound traffic to the affected deployment. At the end, the ready endpoints spike up as we considered that it was an availability issue and provisioned more pods (to no avail), and then finally the errors resolve when we restart the destination controller).

Hopefully someone can come up with a reproduction, but we are going to downgrade to 1.12.5 for the time being.

bjoernw commented 1 year ago

@jandersen-plaid Great write-up, makes sense to me. There have been a couple of issues related to endpoints being out-of-date recently https://github.com/linkerd/linkerd2/issues/10003 https://github.com/linkerd/linkerd2/issues/8956

I'm going to setup a job that continuously monitors the diff between the k8s endpoints api and linkerd's diagnostic command. I just ran this during one of our canary deploys and I did see some IPs in the linkerd list that weren't in the endpoints list for a few seconds. Definitely expected since this is eventually consistent but I'm going to be looking for the scenario where this list doesn't reconcile over time for a specific destination pod.

I see you're also having this issue primarily on your ingress controller. That's usually where it starts for us but then it spreads randomly around to other service-to-service calls as well. I guess the one thing they all have in common is the cache in the destination controller...

If we run this once a minute we should be able to see when a certain pod's cache doesn't seem to recover.

#!/bin/bash

NAMESPACE=my-namespace
ENDPOINT=my-service
PORT=3000

ENDPOINTS=$(kubectl get endpointslice -n $NAMESPACE -l kubernetes.io/service-name=$ENDPOINT -o json | jq -r '.items[].endpoints[].addresses[]' | sort)

LINKERD_DESTINATION_PODS=$(kubectl get pods -n linkerd -l linkerd.io/control-plane-component=destination -o jsonpath='{.items[*].metadata.name}')

for POD in $LINKERD_DESTINATION_PODS
do
  LINKERD_DIAGNOSTICS=$(linkerd diagnostics endpoints --destination-pod $POD $ENDPOINT.$NAMESPACE.svc.cluster.local:$PORT | awk 'NR>1 {print $2}' | sort)

  ONLY_IN_ENDPOINTS=$(comm -23 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))

  ONLY_IN_LINKERD=$(comm -13 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))

  if [[ -z $ONLY_IN_ENDPOINTS && -z $ONLY_IN_LINKERD ]]
  then
    echo "Both IP sets are identical for Pod $POD."
  else
    if [[ -n $ONLY_IN_ENDPOINTS ]]
    then
      echo "IPs only in EndpointSlices for Pod $POD:"
      echo "$ONLY_IN_ENDPOINTS"
    fi

    if [[ -n $ONLY_IN_LINKERD ]]
    then
      echo "IPs only in Linkerd Diagnostics for Pod $POD:"
      echo "$ONLY_IN_LINKERD"
    fi
  fi
done
bjoernw commented 1 year ago

@jandersen-plaid Are you seeing any endpoint update failures in your k8s events around the time of the issue?

I'm seeing an elevated rate of these on the problematic service:

1461 FailedToUpdateEndpoint: Failed to update endpoint my-namespace/my-service: Operation cannot be fulfilled on endpoints "my-service": the object has been modified; please apply your changes to the latest version and try again
jandersen-plaid commented 1 year ago

@jandersen-plaid Are you seeing any endpoint update failures in your k8s events around the time of the issue?

I'm seeing an elevated rate of these on the problematic service:

1461 FailedToUpdateEndpoint: Failed to update endpoint my-namespace/my-service: Operation cannot be fulfilled on endpoints "my-service": the object has been modified; please apply your changes to the latest version and try again

Hmmm is this coming from the endpoint controller or linkerd? On the endpoint controller side, we do see this error (a 409 in api server logs because of the resource conflict) around the incident, but it also happens rather frequently in the cluster outside of the incident window as well. We may have seen it on the linkerd side during the incident, but some of our logs from the most recent occurrence have reduced granularity at this point, so I can't confirm that.

bjoernw commented 1 year ago

@jandersen-plaid that's from the endpoint controller. In the past I had suspected that throttling at the API server level was causing the endpoints API itself to be out of sync with the current set of Pods hence leading the linkerd cache to be out of sync but it seems like the problem lies between endpoints and linkerd. I'm running a version of that script above in the cluster now with automatic restarts of the destination controller and extensive logging. Will report back if I find anything interesting.

nikskiz commented 1 year ago

We have experienced the same issue, and rolling back the proxy with linkerd2 stable-2.12.x worked. When observing the linekrd metrics exposed we could see the "loadshed" error as outlined above by @jandersen-plaid

This is easily reproduced by following the steps below

Prerequisites

Steps to replicate proxy errors "service unavailable"

  1. Create the k6 test
    
    cat <<EOF >> k6s-test.js
    // import necessary modules
    import { check } from "k6";
    import http from "k6/http";

// define configuration export const options = { // define thresholds thresholds: { http_req_failed: ['rate<0.01'], // http errors should be less than 1% http_req_duration: ['p(99)<1000'], // 99% of requests should be below 1s }, // define scenarios scenarios: { // arbitrary name of scenario average_load: { executor: "ramping-vus", stages: [ { duration: "10s", target: 500 }, ], }, } };

export default function () { // define URL and request body const url = "http://localhost:8080/api/vote?choice=:rabbit:"; const payload = JSON.stringify({ choice: ":rabbit:" }); const params = { headers: { "Content-Type": "application/json", }, };

// send a post request and save response as a variable const res = http.post(url, payload, params);

// check that response is 200 check(res, { "response code was 200": (res) => res.status == 200, }); } EOF


2. Install k8s cluster using kind

kind create cluster


3. Install Linkerd2 CRDS

linkerd install --crds | kubectl apply -f -


4. Install Linkerd2

linkerd install | kubectl apply -f -


5. Install viz

linkerd viz install | kubectl apply -f -


6. Validate installation

linkerd check


7. Install emojivoto

linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -


8. Check emojivoto status

kubectl -n emojivoto rollout status deploy/web


9. In a separate terminal port forward to emojivoto web app

kubectl port-forward deployment/web 8080 -n emojivoto


10. In a third terminal, follow the proxy logs on the emojivoto `web` deployment

kubectl logs -f -n emojivoto deployment/web -c linkerd-proxy


11. Run k6s test and observer results

k6 run k6s-test.js

jandersen-plaid commented 1 year ago

Thanks for the reproduction @nikskiz! the results of performing the test on 2.13.4 vs 2.12.5 are staggering:

Note: the difference in requests sent can likely be explained by the faster failure in 2.13.4: less time on requests leads to more iterations, but they shouldn't fail in the first place!

Linkerd 2.13.4

k6 Summary

     ✗ response code was 200
      ↳  32% — ✓ 24137 / ✗ 49549

     checks.........................: 32.75% ✓ 24137       ✗ 49549
     data_received..................: 13 MB  1.3 MB/s
     data_sent......................: 15 MB  1.5 MB/s
     http_req_blocked...............: avg=62.37µs min=710ns    med=80.02µs max=36.37ms  p(90)=115.5µs  p(95)=127.52µs
     http_req_connecting............: avg=41.34µs min=0s       med=53.11µs max=36.32ms  p(90)=78.68µs  p(95)=87.03µs
   ✓ http_req_duration..............: avg=33.98ms min=322.75µs med=29.03ms max=126.73ms p(90)=70.71ms  p(95)=96.58ms
       { expected_response:true }...: avg=51.81ms min=2.22ms   med=47.98ms max=126.73ms p(90)=103.06ms p(95)=108.87ms
   ✗ http_req_failed................: 67.24% ✓ 49549       ✗ 24137
     http_req_receiving.............: avg=37µs    min=7.56µs   med=26.99µs max=11.63ms  p(90)=49.68µs  p(95)=56.16µs
     http_req_sending...............: avg=27.87µs min=5.24µs   med=23.12µs max=20.87ms  p(90)=37.21µs  p(95)=42.27µs
     http_req_tls_handshaking.......: avg=0s      min=0s       med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=33.91ms min=278.29µs med=28.94ms max=126.68ms p(90)=70.67ms  p(95)=96.52ms
     http_reqs......................: 73686  7315.819808/s
     iteration_duration.............: avg=34.12ms min=400.5µs  med=29.19ms max=126.87ms p(90)=70.78ms  p(95)=96.76ms
     iterations.....................: 73686  7315.819808/s
     vus............................: 497    min=48        max=497
     vus_max........................: 500    min=500       max=500
k6 logs ``` /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: /opt/k6/script.js output: - scenarios: (100.00%) 1 scenario, 500 max VUs, 40s max duration (incl. graceful stop): * average_load: Up to 500 looping VUs for 10s over 1 stages (gracefulRampDown: 30s, gracefulStop: 30s) running (01.0s), 048/500 VUs, 3368 complete and 0 interrupted iterations average_load [ 10% ] 048/500 VUs 01.0s/10.0s running (02.0s), 098/500 VUs, 7347 complete and 0 interrupted iterations average_load [ 20% ] 098/500 VUs 02.0s/10.0s running (03.0s), 148/500 VUs, 11229 complete and 0 interrupted iterations average_load [ 30% ] 148/500 VUs 03.0s/10.0s running (04.0s), 198/500 VUs, 17293 complete and 0 interrupted iterations average_load [ 40% ] 198/500 VUs 04.0s/10.0s running (05.0s), 248/500 VUs, 25458 complete and 0 interrupted iterations average_load [ 50% ] 248/500 VUs 05.0s/10.0s running (06.0s), 298/500 VUs, 35032 complete and 0 interrupted iterations average_load [ 60% ] 298/500 VUs 06.0s/10.0s running (07.0s), 348/500 VUs, 44495 complete and 0 interrupted iterations average_load [ 70% ] 348/500 VUs 07.0s/10.0s running (08.0s), 398/500 VUs, 53949 complete and 0 interrupted iterations average_load [ 80% ] 398/500 VUs 08.0s/10.0s running (09.0s), 447/500 VUs, 63643 complete and 0 interrupted iterations average_load [ 90% ] 447/500 VUs 09.0s/10.0s running (10.0s), 497/500 VUs, 72804 complete and 0 interrupted iterations average_load [ 100% ] 497/500 VUs 10.0s/10.0s running (10.1s), 000/500 VUs, 73686 complete and 0 interrupted iterations average_load ✓ [ 100% ] 000/500 VUs 10s ✗ response code was 200 ↳ 32% — ✓ 24137 / ✗ 49549 checks.........................: 32.75% ✓ 24137 ✗ 49549 data_received..................: 13 MB 1.3 MB/s data_sent......................: 15 MB 1.5 MB/s http_req_blocked...............: avg=62.37µs min=710ns med=80.02µs max=36.37ms p(90)=115.5µs p(95)=127.52µs http_req_connecting............: avg=41.34µs min=0s med=53.11µs max=36.32ms p(90)=78.68µs p(95)=87.03µs ✓ http_req_duration..............: avg=33.98ms min=322.75µs med=29.03ms max=126.73ms p(90)=70.71ms p(95)=96.58ms { expected_response:true }...: avg=51.81ms min=2.22ms med=47.98ms max=126.73ms p(90)=103.06ms p(95)=108.87ms ✗ http_req_failed................: 67.24% ✓ 49549 ✗ 24137 http_req_receiving.............: avg=37µs min=7.56µs med=26.99µs max=11.63ms p(90)=49.68µs p(95)=56.16µs http_req_sending...............: avg=27.87µs min=5.24µs med=23.12µs max=20.87ms p(90)=37.21µs p(95)=42.27µs http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_waiting...............: avg=33.91ms min=278.29µs med=28.94ms max=126.68ms p(90)=70.67ms p(95)=96.52ms http_reqs......................: 73686 7315.819808/s iteration_duration.............: avg=34.12ms min=400.5µs med=29.19ms max=126.87ms p(90)=70.78ms p(95)=96.76ms iterations.....................: 73686 7315.819808/s vus............................: 497 min=48 max=497 vus_max........................: 500 min=500 max=500 time="2023-07-26T15:29:41Z" level=error msg="some thresholds have failed" ```
`linkerd-proxy` container logs on `web` pod ``` # Snippet of linkerd-proxy container logs on web pod during k6 test ... [ 900.650890s] INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable] [ 900.650896s] INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable] [ 900.650904s] INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable] [ 900.650910s] INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable] [ 900.650916s] INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable] ... ```
Logs from `web-svc` container on the `web` pod ``` # Snippet of web-svc container logs on the web pod during the k6 test ... 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000df0d00 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:59656 /api/vote?choice=:rabbit: 0xc0010d7a70}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000b55880 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:59246 /api/vote?choice=:rabbit: 0xc00074b620}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc00124ea00 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:54418 /api/vote?choice=:rabbit: 0xc000290f30}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000899ac0 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:59070 /api/vote?choice=:rabbit: 0xc000a80390}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000df0800 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:59262 /api/vote?choice=:rabbit: 0xc0010d6b10}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000b55040 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:37612 /api/vote?choice=:rabbit: 0xc00074a4e0}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc0005d0780 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:37064 /api/vote?choice=:rabbit: 0xc0008a3860}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000f27880 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:37008 /api/vote?choice=:rabbit: 0xc000c2d620}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable 2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc00095e100 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] map[] 100.96.51.98:37168 /api/vote?choice=:rabbit: 0xc000a80b40}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable ... ```

Linkerd 2.12.5

k6 Summary

     ✓ response code was 200

     checks.........................: 100.00% ✓ 34525       ✗ 0
     data_received..................: 2.6 MB  257 kB/s
     data_sent......................: 7.0 MB  698 kB/s
     http_req_blocked...............: avg=8.84µs  min=770ns  med=1.74µs  max=41.05ms  p(90)=2.25µs   p(95)=2.63µs
     http_req_connecting............: avg=5.83µs  min=0s     med=0s      max=40.99ms  p(90)=0s       p(95)=0s
   ✓ http_req_duration..............: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms  p(90)=131.12ms p(95)=139.24ms
       { expected_response:true }...: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms  p(90)=131.12ms p(95)=139.24ms
   ✓ http_req_failed................: 0.00%   ✓ 0           ✗ 34525
     http_req_receiving.............: avg=23.62µs min=8.43µs med=21.61µs max=4.63ms   p(90)=30.34µs  p(95)=35.57µs
     http_req_sending...............: avg=13.65µs min=5.69µs med=12.05µs max=730.65µs p(90)=19.75µs  p(95)=22.18µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=73.02ms min=1.26ms med=73.09ms max=150.26ms p(90)=131.09ms p(95)=139.2ms
     http_reqs......................: 34525   3421.759353/s
     iteration_duration.............: avg=73.15ms min=1.37ms med=73.24ms max=150.39ms p(90)=131.21ms p(95)=139.32ms
     iterations.....................: 34525   3421.759353/s
     vus............................: 498     min=49        max=498
     vus_max........................: 500     min=500       max=500
k6 logs ``` /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: /opt/k6/script.js output: - scenarios: (100.00%) 1 scenario, 500 max VUs, 40s max duration (incl. graceful stop): * average_load: Up to 500 looping VUs for 10s over 1 stages (gracefulRampDown: 30s, gracefulStop: 30s) running (01.0s), 049/500 VUs, 3176 complete and 0 interrupted iterations average_load [ 10% ] 049/500 VUs 01.0s/10.0s running (02.0s), 099/500 VUs, 6627 complete and 0 interrupted iterations average_load [ 20% ] 099/500 VUs 02.0s/10.0s running (03.0s), 149/500 VUs, 10086 complete and 0 interrupted iterations average_load [ 30% ] 149/500 VUs 03.0s/10.0s running (04.0s), 198/500 VUs, 13493 complete and 0 interrupted iterations average_load [ 40% ] 198/500 VUs 04.0s/10.0s running (05.0s), 248/500 VUs, 16913 complete and 0 interrupted iterations average_load [ 50% ] 248/500 VUs 05.0s/10.0s running (06.0s), 298/500 VUs, 20348 complete and 0 interrupted iterations average_load [ 60% ] 298/500 VUs 06.0s/10.0s running (07.0s), 348/500 VUs, 23759 complete and 0 interrupted iterations average_load [ 70% ] 348/500 VUs 07.0s/10.0s running (08.0s), 398/500 VUs, 27167 complete and 0 interrupted iterations average_load [ 80% ] 398/500 VUs 08.0s/10.0s running (09.0s), 448/500 VUs, 30527 complete and 0 interrupted iterations average_load [ 90% ] 448/500 VUs 09.0s/10.0s running (10.0s), 498/500 VUs, 33915 complete and 0 interrupted iterations average_load [ 100% ] 498/500 VUs 10.0s/10.0s running (10.1s), 000/500 VUs, 34525 complete and 0 interrupted iterations average_load ✓ [ 100% ] 000/500 VUs 10s ✓ response code was 200 checks.........................: 100.00% ✓ 34525 ✗ 0 data_received..................: 2.6 MB 257 kB/s data_sent......................: 7.0 MB 698 kB/s http_req_blocked...............: avg=8.84µs min=770ns med=1.74µs max=41.05ms p(90)=2.25µs p(95)=2.63µs http_req_connecting............: avg=5.83µs min=0s med=0s max=40.99ms p(90)=0s p(95)=0s ✓ http_req_duration..............: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms p(90)=131.12ms p(95)=139.24ms { expected_response:true }...: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms p(90)=131.12ms p(95)=139.24ms ✓ http_req_failed................: 0.00% ✓ 0 ✗ 34525 http_req_receiving.............: avg=23.62µs min=8.43µs med=21.61µs max=4.63ms p(90)=30.34µs p(95)=35.57µs http_req_sending...............: avg=13.65µs min=5.69µs med=12.05µs max=730.65µs p(90)=19.75µs p(95)=22.18µs http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_waiting...............: avg=73.02ms min=1.26ms med=73.09ms max=150.26ms p(90)=131.09ms p(95)=139.2ms http_reqs......................: 34525 3421.759353/s iteration_duration.............: avg=73.15ms min=1.37ms med=73.24ms max=150.39ms p(90)=131.21ms p(95)=139.32ms iterations.....................: 34525 3421.759353/s vus............................: 498 min=49 max=498 vus_max........................: 500 min=500 max=500 ```
`linkerd-proxy` container logs on `web` pod ``` # linkerd-proxy container logs on web pod during k6 test Defaulted container "linkerd-proxy" out of: linkerd-proxy, web-svc, linkerd-init (init) [ 0.001443s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime [ 0.001781s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191 [ 0.001792s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143 [ 0.001794s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140 [ 0.001795s] INFO ThreadId(01) linkerd2_proxy: Tap DISABLED [ 0.001797s] INFO ThreadId(01) linkerd2_proxy: Local identity is web.emojivoto.serviceaccount.identity.linkerd.cluster.local [ 0.001798s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local) [ 0.001800s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local) [ 1.003837s] WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=100.101.17.33:8090}: linkerd_reconnect: Failed to connect error=connect timed out after 1s [ 1.003882s] WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=100.97.119.64:8090}: linkerd_reconnect: Failed to connect error=connect timed out after 1s [ 1.003891s] WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=100.100.230.80:8090}: linkerd_reconnect: Failed to connect error=connect timed out after 1s [ 1.005052s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=100.101.17.175:8080}: linkerd_reconnect: Failed to connect error=connect timed out after 1s [ 1.005091s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=100.100.230.162:8080}: linkerd_reconnect: Failed to connect error=connect timed out after1s [ 1.005100s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=100.96.168.60:8080}: linkerd_reconnect: Failed to connect error=connect timed out after 1s [ 1.122383s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=web.emojivoto.serviceaccount.identity.linkerd.cluster.local ```
Logs from `web-svc` container on the `web` pod ``` 2023/07/26 15:48:35 Connecting to [voting-svc.emojivoto:8080] 2023/07/26 15:48:35 Connecting to [emoji-svc.emojivoto:8080] 2023/07/26 15:48:35 Starting web server on WEB_PORT=[8080] ```

Clarifying a bit on the particular problem we are seeing in 2.13 (happy to file an additional issue if this is too far off topic):

@nikskiz your test perfectly encapsulates some really eager request failing (our second problem) within Linkerd.

nikskiz commented 1 year ago

@jandersen-plaid Thanks for the details on your first issue. Maybe we can still try to replicate your first issue by modifying the load test parameters. We use the stages in k6 to ramp down the test and try to persist the error during "normal" load. The stages will look as follow

        stages: [
          // Produce load on a steady environment to get the loadshed error
          { duration: "10s", target: 500 },
          // Ramp down the load to what the proxy can handle and restart the downstream "voting" app to get new endpoints
          { duration: "5s", target: 20 },
          // Hold for a 60s and observer
          { duration: "60s", target: 20 },
        ],

In the first 10 seconds, we should be able to get the failures. Then, in 5 seconds we ramp down to 20 iterations. We hold the 20 iterations for another 60s. You should restart the downstream app in the first 10 seconds so that it gets new endpoints. Assuming the proxy is still sending requests to the old endpoint we should see still see errors during the ramp-down (after 15 seconds).

I did run this locally and wasn't able to reproduce any errors. I could see the errors disappear when the ramp-down started, even though the downstream app recieved a new endpoint.

The only way I was able to see the "service unavailable" error again was when I forcefully deleted the voting app endpoind, so it had none endpoints.

bjoernw commented 1 year ago

This is a huge problem. The destination-controller is broken on 2.13.4. Occasionally under load (i guess?) it will stop responding to diagnostics commands and also won't serve up new IPs to proxies leading to a cluster wide failure until the controller is restarted. We're runnings 300+ meshed pods with 5 pods per ctrl plane component so maybe this just shows up at this scale?

I have a job that runs every minute and diffs the IPs for a handful of endpoints per destination controller pod and per endpoint slice. During the outage this job was the canary in the coal mine because it stopped logging. Most likely because the diagnostics command hung. The destination controllers stopped logging anything at all (they usually complain about Failed to get node topology zone for node on a regular basis).

I'm going to downgrade to 2.12 as soon as possible.

risingspiral commented 1 year ago

Hey folks - thanks for the reports here! Linkerd maintainers are actively investigating multiple things raised in this thread. Thanks to @nikskiz for the repro steps. We're using these to investigate the loadshed behavior in the proxy and determine its root cause.

For the issues around destination controller stability - this is independent from the loadshed behavior. @alpeb will be following up to gather more information to determine what may be at play. More soon!

alpeb commented 1 year ago

I'm happy to report that the aggressive load shedding behavior has just been fixed in edge-23.8.1 released yesterday, and will be back-ported into stable-2.13.6. Now let's focus on the destination cache staleness issue. While investigating this I found it's possible that the outbound_http_balancer_endpoints metric has a bug, so let's disregard that one for now. @bjoernw's script above should help catching any discrepancies. Another thing to track are the endpoints_cache_size and endpointslices_cache_size on the Destination containers, which should match exactly the total number of endpoints in the cluster. I'm still trying to find a scenario to replicate this discrepancy. If you have debug logs for Destination under those circumstances that'd be helpful as well.

hadican commented 1 year ago

Thank you so much! @alpeb

hawkw commented 1 year ago

Hi folks, I just wanted to chime in about edge-23.8.1, which includes the fix for a regression that made proxies much more likely to shed load under moderate load (linkerd/linkerd2-proxy#2449). We believe that change should resolve the issues described in this thread — it certainly resolves the issue reproduced by @nikskiz in https://github.com/linkerd/linkerd2/issues/11055#issuecomment-1650957357. Therefore, if anyone has the opportunity to test the edge release in your staging/testing environments, we'd love confirmation that there are no additional issues going on here.

In any case, the fix will make it into stable-2.13.6 later this week.

vkperumal commented 1 year ago

@risingspiral We are still seeing the issue after upgrading to 2.13.6

[  1120.099428s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51902
[  1120.099830s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51900
[  1120.100077s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51898
[  1120.100493s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51892
[  1120.100737s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51882
[  1120.100947s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51794
[  1120.101318s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51796
[  1120.101548s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51782
[  1120.101791s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51778
[  1120.102012s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51804
[  1240.260261s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:42012
tgolly commented 1 year ago

Have upgraded to 2.13.6 to try and resolve the endpoint failures, and I'm still seeing tons of these Failed to get node topology zone for node errors on the destination controller like @bjoernw was seeing. This is under load.

2023-08-23T16:15:28+10:00 time="2023-08-23T06:15:28Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:49300" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:00+10:00 time="2023-08-23T06:16:00Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60264" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:33+10:00 time="2023-08-23T06:16:33Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:40158" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:05+10:00 time="2023-08-23T06:17:05Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:53556" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:38+10:00 time="2023-08-23T06:17:38Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:36490" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:10+10:00 time="2023-08-23T06:18:10Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60976" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:43+10:00 time="2023-08-23T06:18:43Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:34732" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:19:16+10:00 time="2023-08-23T06:19:16Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:38158" service="service-xxx.ns.svc.cluster.local:8080"
wmorgan commented 1 year ago

Thanks for the update @tgolly @vkperumal. This is still on our radar.

tgolly commented 1 year ago

Those "Failed to get node topology zone" errors calmed down after a couple of days. Not seeing them anymore.

So is this issue considered resolved in 2.13.6?

risingspiral commented 1 year ago

Hey folks - thanks for all the help on this issue. For now we're marking this as resolved as a regression between 2.12 and 2.13 that led to premature loadshed due to a small queue size has been addressed. @vkperumal if you're still seeing errors it would be good to understand the total request volume the service is seeing when errors are encountered and what the outbound_http_error_total metric is reporting. We can re-open this issue or start with a fresh one.

@bjoernw for destination controller reliability and consistency under high endpoint churn we'll track investigation on https://github.com/linkerd/linkerd2/issues/11279