linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.62k stars 1.27k forks source link

linkerd-proxy drops memcached persistent connections. #4689

Closed heimdull closed 3 years ago

heimdull commented 4 years ago

Bug Report

What is the issue?

We have a few clusters running with linkerd 2.7.1 where we tried to upgrade but every version after 2.7.1 drops our memcached connections. We have tomcat containers that connect to memcached servers outside of the kubernetes/linkerd cluster and after the upgrade these connections are dropped. Rolling back to 2.7.1 resolves the issue.

We tested upgrades to all available versions after 2.7.1 and they all experience the same dropped connection.

How can it be reproduced?

Tomcat container with persistent connections to a memcached host should show the issue.

Logs, error output, etc

shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 2252 31.968026877 10.21.40.53 → 10.42.1.35 TCP 68 11211 → 53312 [FIN, ACK] Seq=1 Ack=1 Win=43776 Len=0 TSval=222176710 TSecr=1190770320 shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 2253 31.968202261 10.21.40.54 → 10.42.1.35 TCP 68 11211 → 37636 [FIN, ACK] Seq=1 Ack=1 Win=43776 Len=0 TSval=222176710 TSecr=1310269368 shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 2809 91.370066554 10.21.40.55 → 10.42.1.35 TCP 56 11211 → 56552 [RST] Seq=2 Win=0 Len=0 shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 2956 93.378264743 10.21.40.55 → 10.42.1.35 TCP 76 11211 → 58108 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=222238121 TSecr=2296207651 WS=128 shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 2972 96.379448285 10.21.40.55 → 10.42.1.35 TCP 68 11211 → 58108 [FIN, ACK] Seq=1 Ack=1 Win=43776 Len=0 TSval=222241122 TSecr=2296207651 shard-cdr-7c6fb9d89d-jlhcd tomcat 2020-06-29 15:26:07.758 INFO net.spy.memcached.MemcachedConnection: Reconnecting due to exception on {QA sa=shard1mem1.dev.youmail.com/10.21.40.55:11211, #Rops=1, #Wops=0, #iq=0, topRop=Cmd: set Key: ym.memc.inspect.560222747 Flags: 0 Exp: 172800 Data Length: 16, topWop=null, toWrite=0, interested=1} shard-cdr-7c6fb9d89d-jlhcd tomcat 2020-06-29 15:26:07.758 WARN net.spy.memcached.MemcachedConnection: Closing, and reopening {QA sa=shard1mem1.dev.youmail.com/10.21.40.55:11211, #Rops=1, #Wops=0, #iq=0, topRop=Cmd: set Key: ym.memc.inspect.560222747 Flags: 0 Exp: 172800 Data Length: 16, topWop=null, toWrite=0, interested=1}, attempt 0. shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 6414 392.175887068 10.21.40.55 → 10.42.1.35 TCP 56 11211 → 58108 [RST] Seq=2 Win=0 Len=0 shard-cdr-7c6fb9d89d-jlhcd tomcat 2020-06-29 15:26:09.762 INFO net.spy.memcached.MemcachedConnection: Reconnecting {QA sa=shard1mem1.dev.youmail.com/10.21.40.55:11211, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0} shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 6462 394.180915041 10.21.40.55 → 10.42.1.35 TCP 76 11211 → 36332 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=222538927 TSecr=2296508457 WS=128 shard-cdr-7c6fb9d89d-jlhcd linkerd-debug 6464 397.184152372 10.21.40.55 → 10.42.1.35 TCP 68 11211 → 36332 [FIN, ACK] Seq=1 Ack=1 Win=43776 Len=0 TSval=222541931 TSecr=2296508457

linkerd check output

--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust roots are using supported crypto algorithm
√ trust roots are within their validity period
√ trust roots are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust root

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.7.1 but the latest stable version is 2.8.1
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.7.1 but the latest stable version is 2.8.1
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

Status check results are √

Environment

Possible solution

Looks like problem started in edge-20.4.2 and the issue might be this:


- Added a new protocol detection timeout to prevent clients from consuming
- resources indefinitely when not sending any data```

This text perfectly describes a persistent memcached connection

### Additional context

After a little more digging I think this was introduced in proxy v2.91.0

linkerd/app/inbound/src/lib.rs:
                // Limits the amount of time that the TCP server spends waiting for TLS handshake &
                // protocol detection. Ensures that connections that never emit data are dropped
                // eventually.
hawkw commented 4 years ago

I think that configuring the proxy to skip protocol detection for the ports used by memcached should also solve this issue. We may want to add port 11211 (the memcached registered port) to the list of ports that skip protocol detection by default, in case persistent connections are being used?

ihcsim commented 4 years ago

This might get fixed as part of the upcoming TCP mTLS and server-speak-first work. Let’s check back in then!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.