Closed vkperumal closed 1 year ago
Since our upgrade to 2.13.4 we've been experiencing the same errors in our cluster which result in transient 503 returned to the clients. It looks like the same or very similar problem reported in https://linkerd.buoyant.io/t/503-service-unavailable-from-proxy-with-large-number-of-connections/126. It doesn't affect any service specifically, the whole mesh is affected randomly during peak load. There's no throttling or any resource constraint issues, we're not using service profiles, Linkerd's circuit breaking or dynamic routing.
The log of the affected clients is filled with information, but I don't see an answer as to why the proxy is marking services as unavailable:
2023-06-28T10:52:48Z [ 94126.531685s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36056}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z [ 94126.531675s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z [ 94126.531668s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "3081", "content-type": "application/json", "traceparent": "00-083c196cf720aab5452406dfa155fda3-19bc1b8cdb26cebd-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z [ 94126.531665s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z [ 94126.531657s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17df00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 43, rx_fields: "..." } } } })
2023-06-28T10:52:48Z [ 94126.531641s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z [ 94126.531634s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "22912", "content-type": "application/json", "traceparent": "00-23856363099ba7654f627b8ab7e2905a-9328a2c69cd37c13-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z [ 94126.531630s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z [ 94126.531620s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 23, rx_fields: "..." } } } })
2023-06-28T10:52:48Z [ 94126.531598s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z [ 94126.531564s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "94320", "content-type": "application/json", "traceparent": "00-a638ca610eaefb86a771ebaaa411f072-9b465174c9c12082-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z [ 94126.531559s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z [ 94126.531541s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17da00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 24, rx_fields: "..." } } } })
2023-06-28T10:52:48Z [ 94126.531466s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z [ 94126.531444s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
2023-06-28T10:52:48Z [ 94126.531357s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z [ 94126.531314s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The stack is tearing down the connection
2023-06-28T10:52:48Z [ 94126.531265s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35784}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z [ 94126.531237s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z [ 94126.531232s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z [ 94126.531214s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z [ 94126.531173s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z [ 94126.531165s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z [ 94126.531148s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z [ 94126.531121s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z [ 94126.531109s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z [ 94126.531105s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "52431", "content-type": "application/json", "traceparent": "00-8c14a06804f62ea8bb6f352ee3f4f0e4-4acb6b9ebed60214-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z [ 94126.531102s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z [ 94126.530991s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 22, rx_fields: "..." } } } })
2023-06-28T10:52:48Z [ 94126.530792s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_app_core::errors::respond: Handling error on HTTP connection status=503 Service Unavailable version=HTTP/1.1 close=true
2023-06-28T10:52:48Z [ 94126.530791s] DEBUG ThreadId(02) outbound: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z [ 94126.530785s] INFO ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http:rescue{client.addr=172.16.85.145:52548}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.109.178:80: service unavailable error.sources=[service unavailable]
2023-06-28T10:52:48Z [ 94126.530775s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service shedding load
2023-06-28T10:52:48Z [ 94126.530767s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service has become unavailable
Metrics are not helpful either, according to them, everything is working fine, but it isn't. The proxy is not adding those 503 as failed requests, they're not visible in viz or Prometheus. The only real clue comes from transport level metrics, where we see the same loadshed
errors visible in debug logs, example from the same client pod as the logs:
Sadly I don't have reproduction instructions, but something must have changed between 2.12 and 2.13 that causes this odd behaviour.
@someone-stole-my-name What version were you running before?
@MarkSRobinson Issues started for me after upgrading from 2.11.1 to 2.13.4. Issue still exists on 2.13.5 as well
@someone-stole-my-name What version were you running before?
2.12.4
We've reverted just the data plane to the previous version and things are stable again, so far. We replaced some core components first, and in the rollback process noticed something that may give some clues on where to look. Reverting just the outbound proxies doesn't fix the issue, you still get the slightly different errors from v2.12 but same outcome, service marked as unavailable and 503 returned to the client:
2023-06-28T15:32:28Z [ 512.908510s] WARN ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_reconnect: Service failed error=channel closed
2023-06-28T15:32:28Z [ 512.908502s] WARN ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_outbound::http::server: Unexpected error error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z [ 512.908498s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z [ 512.908460s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z [ 512.854087s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.854013s] INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.854006s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853958s] INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853927s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853877s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853843s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853806s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853762s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853718s] INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853714s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853647s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853594s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853582s] INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853507s] INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z [ 512.853455s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:25:42Z [ 106.568334s] WARN ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_reconnect: Service failed error=channel closed
2023-06-28T15:25:42Z [ 106.568300s] WARN ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_outbound::http::server: Unexpected error error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z [ 106.568295s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z [ 106.568265s] INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z [ 106.525745s] INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:38646: server: 172.16.64.39:8501: server 172.16.64.39:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
Reverting the inbound proxy to 2.12 apparently fixes this issue. Here are some logs from the inbound side while still using 2.13.4. Notice how the pattern changes around 2023-06-28T15:32:28Z [195845.387669s]
, then all connections are dropped and established again around 2023-06-28T15:32:28Z [195845.401767s]
.
@vkperumal Can you tell if the target services are under high load when those errors occur, like reported by @someone-stole-my-name ?
@alpeb My application is a scheduler which makes outbound http calls to other apps. There is no load on the application
Having same errors after upgrading to 2.13.5 and only load was communication between the services in our application. Errors were fixed after downgrading to 2.12.5. (For some reason our external prometheus metrics also stopped working after upgrading linkerd-viz to 2.13.5 but started working again after downgrading to 2.12.5)
Also experiencing the same issue - seems it was introduced in 2.13.0, as I have switched between 2.12.5 / 2.13.0 today and seen no issues in the former.
It is also very specifically occurring when making over 100 concurrent outgoing requests. Any less and everything works as normal, any more (even 101) cause the the same errors described in the original issue and crash the respective client pod.
We also experiencing unpredictable connections issue between our services since upgrading from 2.12.5 to 2.13.5 - no heavy load on our apps at the moment, and still can't figure out what could cause this issue.
WARN ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_outbound::http::server: Unexpected error error=logical service 10.0.187.210:80: default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready, endpoint 10.1.17.12:80: operation was canceled: connection was not ready, operation was canceled: connection was not ready, connection was not ready]
[ 7750.378237s] INFO ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 10.0.187.210:80: default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready, endpoint 10.1.17.12:80: operation was canceled: connection was not ready, operation was canceled: connection was not ready, connection was not ready]
INFO ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:service{ns= name=service port=0}:endpoint{addr=10.1.17.12:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_core::errors::respond: gRPC request failed error=endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[operation was canceled: connection was not ready, connection was not ready]
I'm having this same issue on a cluster with several hundred meshed pods. I'm on 2.13.4. I worry that it's an out-of-date endpoints cache issue that is causing the whole service to be deemed unavailable. The only way to recover from this issue is to restart all the proxies that are logging this issue.
I worry that it's an out-of-date endpoints cache issue
Similar to other posters, we have run into this issue and reported it elsewhere (see https://linkerd.buoyant.io/t/linkerd-destination-has-stale-data-in-endpoint-cache/171). @bjoernw, we also think it is a stale endpoints cache, but we found that we could recover from incidences of it by restarting the destination controllers (presumably this sends a RESET to the affected proxies, and that clears their caches?).
We could not reproduce it after ~2 weeks of trying, but our suspicion is that what happens is:
distribute
's BackendCache
maybe?) in the proxy fills up with balancer endpoints that are pending
(whether through a cache without aggressive enough eviction or some other reason).pending
state, and the service moves into a loadshed
state after going into failfast
.loadshed
it starts to consider alternate backends & endpoints that are in the affected cache.pending
are then each considered to try and find one available.ready
endpoints are never hit upon or evicted from the cache????ready
endpoints, so the loadshedding // failfast never recoversRESET
of the endpoint caches in the proxies, which moves the state out of loadshed
// failfast
Of course, this is all conjecture: without a reproduction it is really hard to prove any of this is happening or get the necessary log info in order to track down which cache is failing in what way. Since we don't by default log debug level, all we have are our metrics which indicate that the outbound_http_balancer_endpoints
had large number of pending
endpoints during the incident, some correlation that stale IPs were being accessed around the time of the incident (from our CNI) which resolved directly after we restarted the destination controller, and a huge rise in loadshed
errors from outbound_http_errors_total
Correlation between loadshedding behavior and stale ips (note that stale ips were being accessed before the major errors start, which indicates that some bad endpoints in the cache were being accessed on a relatively frequent basis, but not considered for routing http requests until something -- a rise in latency or a few 504s -- triggered failfast
// loadshedding
behavior):
Relationship between loadshedding errors and endpoints:
(it is worth noting that the divot in the graph is where we attempted to restart the affected inbound proxies // pods, but were unable to recover from the situation as we did not restart the nginx pods which were sending the outbound traffic to the affected deployment. At the end, the ready
endpoints spike up as we considered that it was an availability issue and provisioned more pods (to no avail), and then finally the errors resolve when we restart the destination controller).
Hopefully someone can come up with a reproduction, but we are going to downgrade to 1.12.5 for the time being.
@jandersen-plaid Great write-up, makes sense to me. There have been a couple of issues related to endpoints being out-of-date recently https://github.com/linkerd/linkerd2/issues/10003 https://github.com/linkerd/linkerd2/issues/8956
I'm going to setup a job that continuously monitors the diff between the k8s endpoints api and linkerd's diagnostic command. I just ran this during one of our canary deploys and I did see some IPs in the linkerd list that weren't in the endpoints list for a few seconds. Definitely expected since this is eventually consistent but I'm going to be looking for the scenario where this list doesn't reconcile over time for a specific destination pod.
I see you're also having this issue primarily on your ingress controller. That's usually where it starts for us but then it spreads randomly around to other service-to-service calls as well. I guess the one thing they all have in common is the cache in the destination controller...
If we run this once a minute we should be able to see when a certain pod's cache doesn't seem to recover.
#!/bin/bash
NAMESPACE=my-namespace
ENDPOINT=my-service
PORT=3000
ENDPOINTS=$(kubectl get endpointslice -n $NAMESPACE -l kubernetes.io/service-name=$ENDPOINT -o json | jq -r '.items[].endpoints[].addresses[]' | sort)
LINKERD_DESTINATION_PODS=$(kubectl get pods -n linkerd -l linkerd.io/control-plane-component=destination -o jsonpath='{.items[*].metadata.name}')
for POD in $LINKERD_DESTINATION_PODS
do
LINKERD_DIAGNOSTICS=$(linkerd diagnostics endpoints --destination-pod $POD $ENDPOINT.$NAMESPACE.svc.cluster.local:$PORT | awk 'NR>1 {print $2}' | sort)
ONLY_IN_ENDPOINTS=$(comm -23 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))
ONLY_IN_LINKERD=$(comm -13 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))
if [[ -z $ONLY_IN_ENDPOINTS && -z $ONLY_IN_LINKERD ]]
then
echo "Both IP sets are identical for Pod $POD."
else
if [[ -n $ONLY_IN_ENDPOINTS ]]
then
echo "IPs only in EndpointSlices for Pod $POD:"
echo "$ONLY_IN_ENDPOINTS"
fi
if [[ -n $ONLY_IN_LINKERD ]]
then
echo "IPs only in Linkerd Diagnostics for Pod $POD:"
echo "$ONLY_IN_LINKERD"
fi
fi
done
@jandersen-plaid Are you seeing any endpoint update failures in your k8s events around the time of the issue?
I'm seeing an elevated rate of these on the problematic service:
1461 FailedToUpdateEndpoint: Failed to update endpoint my-namespace/my-service: Operation cannot be fulfilled on endpoints "my-service": the object has been modified; please apply your changes to the latest version and try again
@jandersen-plaid Are you seeing any endpoint update failures in your k8s events around the time of the issue?
I'm seeing an elevated rate of these on the problematic service:
1461 FailedToUpdateEndpoint: Failed to update endpoint my-namespace/my-service: Operation cannot be fulfilled on endpoints "my-service": the object has been modified; please apply your changes to the latest version and try again
Hmmm is this coming from the endpoint controller or linkerd? On the endpoint controller side, we do see this error (a 409 in api server logs because of the resource conflict) around the incident, but it also happens rather frequently in the cluster outside of the incident window as well. We may have seen it on the linkerd side during the incident, but some of our logs from the most recent occurrence have reduced granularity at this point, so I can't confirm that.
@jandersen-plaid that's from the endpoint controller. In the past I had suspected that throttling at the API server level was causing the endpoints API itself to be out of sync with the current set of Pods hence leading the linkerd cache to be out of sync but it seems like the problem lies between endpoints and linkerd. I'm running a version of that script above in the cluster now with automatic restarts of the destination controller and extensive logging. Will report back if I find anything interesting.
We have experienced the same issue, and rolling back the proxy with linkerd2 stable-2.12.x worked. When observing the linekrd metrics exposed we could see the "loadshed" error as outlined above by @jandersen-plaid
This is easily reproduced by following the steps below
cat <<EOF >> k6s-test.js
// import necessary modules
import { check } from "k6";
import http from "k6/http";
// define configuration export const options = { // define thresholds thresholds: { http_req_failed: ['rate<0.01'], // http errors should be less than 1% http_req_duration: ['p(99)<1000'], // 99% of requests should be below 1s }, // define scenarios scenarios: { // arbitrary name of scenario average_load: { executor: "ramping-vus", stages: [ { duration: "10s", target: 500 }, ], }, } };
export default function () { // define URL and request body const url = "http://localhost:8080/api/vote?choice=:rabbit:"; const payload = JSON.stringify({ choice: ":rabbit:" }); const params = { headers: { "Content-Type": "application/json", }, };
// send a post request and save response as a variable const res = http.post(url, payload, params);
// check that response is 200 check(res, { "response code was 200": (res) => res.status == 200, }); } EOF
2. Install k8s cluster using kind
kind create cluster
3. Install Linkerd2 CRDS
linkerd install --crds | kubectl apply -f -
4. Install Linkerd2
linkerd install | kubectl apply -f -
5. Install viz
linkerd viz install | kubectl apply -f -
6. Validate installation
linkerd check
7. Install emojivoto
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -
8. Check emojivoto status
kubectl -n emojivoto rollout status deploy/web
9. In a separate terminal port forward to emojivoto web app
kubectl port-forward deployment/web 8080 -n emojivoto
10. In a third terminal, follow the proxy logs on the emojivoto `web` deployment
kubectl logs -f -n emojivoto deployment/web -c linkerd-proxy
11. Run k6s test and observer results
k6 run k6s-test.js
Thanks for the reproduction @nikskiz! the results of performing the test on 2.13.4 vs 2.12.5 are staggering:
Note: the difference in requests sent can likely be explained by the faster failure in 2.13.4: less time on requests leads to more iterations, but they shouldn't fail in the first place!
✗ response code was 200
↳ 32% — ✓ 24137 / ✗ 49549
checks.........................: 32.75% ✓ 24137 ✗ 49549
data_received..................: 13 MB 1.3 MB/s
data_sent......................: 15 MB 1.5 MB/s
http_req_blocked...............: avg=62.37µs min=710ns med=80.02µs max=36.37ms p(90)=115.5µs p(95)=127.52µs
http_req_connecting............: avg=41.34µs min=0s med=53.11µs max=36.32ms p(90)=78.68µs p(95)=87.03µs
✓ http_req_duration..............: avg=33.98ms min=322.75µs med=29.03ms max=126.73ms p(90)=70.71ms p(95)=96.58ms
{ expected_response:true }...: avg=51.81ms min=2.22ms med=47.98ms max=126.73ms p(90)=103.06ms p(95)=108.87ms
✗ http_req_failed................: 67.24% ✓ 49549 ✗ 24137
http_req_receiving.............: avg=37µs min=7.56µs med=26.99µs max=11.63ms p(90)=49.68µs p(95)=56.16µs
http_req_sending...............: avg=27.87µs min=5.24µs med=23.12µs max=20.87ms p(90)=37.21µs p(95)=42.27µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=33.91ms min=278.29µs med=28.94ms max=126.68ms p(90)=70.67ms p(95)=96.52ms
http_reqs......................: 73686 7315.819808/s
iteration_duration.............: avg=34.12ms min=400.5µs med=29.19ms max=126.87ms p(90)=70.78ms p(95)=96.76ms
iterations.....................: 73686 7315.819808/s
vus............................: 497 min=48 max=497
vus_max........................: 500 min=500 max=500
✓ response code was 200
checks.........................: 100.00% ✓ 34525 ✗ 0
data_received..................: 2.6 MB 257 kB/s
data_sent......................: 7.0 MB 698 kB/s
http_req_blocked...............: avg=8.84µs min=770ns med=1.74µs max=41.05ms p(90)=2.25µs p(95)=2.63µs
http_req_connecting............: avg=5.83µs min=0s med=0s max=40.99ms p(90)=0s p(95)=0s
✓ http_req_duration..............: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms p(90)=131.12ms p(95)=139.24ms
{ expected_response:true }...: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms p(90)=131.12ms p(95)=139.24ms
✓ http_req_failed................: 0.00% ✓ 0 ✗ 34525
http_req_receiving.............: avg=23.62µs min=8.43µs med=21.61µs max=4.63ms p(90)=30.34µs p(95)=35.57µs
http_req_sending...............: avg=13.65µs min=5.69µs med=12.05µs max=730.65µs p(90)=19.75µs p(95)=22.18µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=73.02ms min=1.26ms med=73.09ms max=150.26ms p(90)=131.09ms p(95)=139.2ms
http_reqs......................: 34525 3421.759353/s
iteration_duration.............: avg=73.15ms min=1.37ms med=73.24ms max=150.39ms p(90)=131.21ms p(95)=139.32ms
iterations.....................: 34525 3421.759353/s
vus............................: 498 min=49 max=498
vus_max........................: 500 min=500 max=500
Clarifying a bit on the particular problem we are seeing in 2.13 (happy to file an additional issue if this is too far off topic):
loadshed
warnings appear (because, it is likely better to shed some load rather than fail all requests), but that:
loadshed
warnings will never resolve because the endpoints they are sending requests to are endpoints that did exist in the cluster, but don't any longer.loadshed
is too aggressive and will fail requests even though the backing application is able to respond adequately to the incoming load.loadshed
is not necessarily an error, and just represents a state of the proxy where the proxy is unable to handle all requests adequately and must drop some requests in order to respond to the rest. This is similar to failfast
https://linkerd.buoyant.io/t/what-are-failfast-errors-in-linkerd-and-how-do-you-debug-them/85 except that, instead of failing all requests, the proxy will just fail a specific percentage of them so that the other requests can succeed. It is hard to know though, because I could not find any documentation on loadshed
warnings or the state whatsoever.@nikskiz your test perfectly encapsulates some really eager request failing (our second problem) within Linkerd.
@jandersen-plaid Thanks for the details on your first issue. Maybe we can still try to replicate your first issue by modifying the load test parameters. We use the stages in k6 to ramp down the test and try to persist the error during "normal" load. The stages will look as follow
stages: [
// Produce load on a steady environment to get the loadshed error
{ duration: "10s", target: 500 },
// Ramp down the load to what the proxy can handle and restart the downstream "voting" app to get new endpoints
{ duration: "5s", target: 20 },
// Hold for a 60s and observer
{ duration: "60s", target: 20 },
],
In the first 10 seconds, we should be able to get the failures. Then, in 5 seconds we ramp down to 20 iterations. We hold the 20 iterations for another 60s. You should restart the downstream app in the first 10 seconds so that it gets new endpoints. Assuming the proxy is still sending requests to the old endpoint we should see still see errors during the ramp-down (after 15 seconds).
I did run this locally and wasn't able to reproduce any errors. I could see the errors disappear when the ramp-down started, even though the downstream app recieved a new endpoint.
The only way I was able to see the "service unavailable" error again was when I forcefully deleted the voting app endpoind, so it had none
endpoints.
This is a huge problem. The destination-controller is broken on 2.13.4. Occasionally under load (i guess?) it will stop responding to diagnostics commands and also won't serve up new IPs to proxies leading to a cluster wide failure until the controller is restarted. We're runnings 300+ meshed pods with 5 pods per ctrl plane component so maybe this just shows up at this scale?
I have a job that runs every minute and diffs the IPs for a handful of endpoints per destination controller pod and per endpoint slice. During the outage this job was the canary in the coal mine because it stopped logging. Most likely because the diagnostics command hung. The destination controllers stopped logging anything at all (they usually complain about Failed to get node topology zone for node
on a regular basis).
I'm going to downgrade to 2.12 as soon as possible.
Hey folks - thanks for the reports here! Linkerd maintainers are actively investigating multiple things raised in this thread. Thanks to @nikskiz for the repro steps. We're using these to investigate the loadshed behavior in the proxy and determine its root cause.
For the issues around destination controller stability - this is independent from the loadshed behavior. @alpeb will be following up to gather more information to determine what may be at play. More soon!
I'm happy to report that the aggressive load shedding behavior has just been fixed in edge-23.8.1 released yesterday, and will be back-ported into stable-2.13.6.
Now let's focus on the destination cache staleness issue. While investigating this I found it's possible that the outbound_http_balancer_endpoints
metric has a bug, so let's disregard that one for now. @bjoernw's script above should help catching any discrepancies. Another thing to track are the endpoints_cache_size
and endpointslices_cache_size
on the Destination containers, which should match exactly the total number of endpoints in the cluster. I'm still trying to find a scenario to replicate this discrepancy. If you have debug logs for Destination under those circumstances that'd be helpful as well.
Thank you so much! @alpeb
Hi folks, I just wanted to chime in about edge-23.8.1, which includes the fix for a regression that made proxies much more likely to shed load under moderate load (linkerd/linkerd2-proxy#2449). We believe that change should resolve the issues described in this thread — it certainly resolves the issue reproduced by @nikskiz in https://github.com/linkerd/linkerd2/issues/11055#issuecomment-1650957357. Therefore, if anyone has the opportunity to test the edge release in your staging/testing environments, we'd love confirmation that there are no additional issues going on here.
In any case, the fix will make it into stable-2.13.6 later this week.
@risingspiral We are still seeing the issue after upgrading to 2.13.6
[ 1120.099428s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51902
[ 1120.099830s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51900
[ 1120.100077s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51898
[ 1120.100493s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51892
[ 1120.100737s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51882
[ 1120.100947s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51794
[ 1120.101318s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51796
[ 1120.101548s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51782
[ 1120.101791s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51778
[ 1120.102012s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51804
[ 1240.260261s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:42012
Have upgraded to 2.13.6 to try and resolve the endpoint failures, and I'm still seeing tons of these Failed to get node topology zone for node
errors on the destination controller like @bjoernw was seeing. This is under load.
2023-08-23T16:15:28+10:00 time="2023-08-23T06:15:28Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:49300" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:00+10:00 time="2023-08-23T06:16:00Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60264" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:33+10:00 time="2023-08-23T06:16:33Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:40158" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:05+10:00 time="2023-08-23T06:17:05Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:53556" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:38+10:00 time="2023-08-23T06:17:38Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:36490" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:10+10:00 time="2023-08-23T06:18:10Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60976" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:43+10:00 time="2023-08-23T06:18:43Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:34732" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:19:16+10:00 time="2023-08-23T06:19:16Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:38158" service="service-xxx.ns.svc.cluster.local:8080"
Thanks for the update @tgolly @vkperumal. This is still on our radar.
Those "Failed to get node topology zone" errors calmed down after a couple of days. Not seeing them anymore.
So is this issue considered resolved in 2.13.6?
Hey folks - thanks for all the help on this issue. For now we're marking this as resolved as a regression between 2.12 and 2.13 that led to premature loadshed due to a small queue size has been addressed. @vkperumal if you're still seeing errors it would be good to understand the total request volume the service is seeing when errors are encountered and what the outbound_http_error_total
metric is reporting. We can re-open this issue or start with a fresh one.
@bjoernw for destination controller reliability and consistency under high endpoint churn we'll track investigation on https://github.com/linkerd/linkerd2/issues/11279
What is the issue?
Outbound HTTP calls to application fails with connection closed error intermittently. Source application from which HTTP calls failing is a scheduler application which makes regular calls to other application based on schedule. After the failure, the next schedule works fine. Please see the logs below from linkerd-proxy
How can it be reproduced?
Linkerd Version: 2.13.4
Logs, error output, etc
[129664.685446s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39860 [129664.691331s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39872 [129664.691557s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39878 [129664.692554s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39894 [129664.695277s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39900 [129784.342535s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42724 [129784.342604s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42738 [129784.342642s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42746 [129784.342682s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42760
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None