linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.69k stars 1.28k forks source link

lInkerd-proxy connection refused/closed/reset errors. #5331

Closed prasus closed 3 years ago

prasus commented 3 years ago

Bug Report

What is the issue?

I am working on finding the root cause for an increased connection drops/failures on our linkerd injected apps which happens rarely and noticed a lot of connection related errors (Connection closed error=Service in fail-fast” Connection reset by peer (os error 104) Connection refused (os error 111) Transport endpoint is not connected (os error 107) etc..) in the proxy log on all our services and I am wondering if linkerd has got anything to do with the increased failures in our apps. I'm kind of stuck in finding the root cause, so I would really appreciate if someone can explain more about these connection errors.

How can it be reproduced?

I am not able to reproduce the issue, the issue (increased connection failures) rarely happens and there are a lot of tcp connection closed/refused errors in the proxy logs as seen below.

Logs, error output, etc

"Dec 2, 2020 @ 06:45:47.043","[ 80905.816738s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:39120 target.addr=10.244.57.132:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:47.042","[ 80905.241029s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.90.101:34980 target.addr=10.244.54.131:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
"Dec 2, 2020 @ 06:45:51.858","[    10.069981s]  WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.105:37338 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[    11.203328s]  WARN ThreadId(01) inbound:accept{peer.addr=10.244.33.1:46326 target.addr=10.244.33.169:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)"
"Dec 2, 2020 @ 06:45:51.858","[    22.103038s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.33.169:49660 target.addr=10.244.59.53:9092}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast"
[  4784.452349s]  INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44614 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[  5089.706325s]  INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:44636 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[  5124.234815s]  INFO ThreadId(01) inbound:accept{peer.addr=10.244.44.107:46202 target.addr=10.244.44.106:80}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)

[  5318.467170s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:60420 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  5329.525274s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:59506 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  5331.837249s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:58566 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

Possible solution

Additional context

KIVagant commented 3 years ago

We also noticed thousands of errors after upgrading to 2.9.0. They appeared when we redeployed pods with the newer proxy sidecar. There are no visible application errors, but a lot of proxy errors like this:

[ 75233.174384s]  INFO ThreadId(01) outbound:accept{peer.addr=10.X.X.X:50826 target.addr=10.X.X.X:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

5432 is a PostgreSQL port. We don't filter out Postgres ports in skip-outbound-ports. And it was not necessary with linkerd-proxy 2.7.1. Is it necessary now?

Update: there are some errors related to other non-http services, not only Postgres.

Update2: and many errors from http communications. It looks like the verbosity of the proxy was changed. And we probably need to switch the verbosity to warning/errors only.

adleong commented 3 years ago

Is your PostgreSQL service headless? If so, this sounds like https://github.com/linkerd/linkerd2/issues/5209. This issue caused problems when talking to headless services in stable-2.9.0. It has since been fixed and you can try out the latest edge release to see if it resolves the problem. The fix will be included in the stable-2.9.1 patch release.

prasus commented 3 years ago

Thank you @adleong 👍🏿 I went through the logs and it seems like we are having these messages for non-headless service targets as well.

[  5351.451279s]  INFO ThreadId(01) outbound:accept{peer.addr=10.244.44.106:48372 target.addr=10.97.182.190:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

The target.addr (PostgreSQL) is a non-headless service.

Name:              my-svc-name
Namespace:         my-svc-namespace
Labels:            application=spilo
                   spilo-role=master

Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.97.182.190
Port:              postgresql  5432/TCP
TargetPort:        5432/TCP
Endpoints:         10.244.24.108:5432
Session Affinity:  None
Events:            <none>

I will plan for an update once we have the stable-2.9.1 patch release available and will keep you posted on this. Thanks! 🙂

KIVagant commented 3 years ago

Is your PostgreSQL service headless?

It's RDS, lives not in Kubernetes. Meanwhile I added RDS and some other ports to skipped. Linkerd upgrade requires cluster downtime, so we cannot do this too often.

isabey-cogni commented 3 years ago

I am also seeing a good amount of these logs. I upgraded our staging environment to stable-2.9.1 in hopes of resolution, but to no avail.

Here is a sample of logs in one of our linkerd-proxy containers:

[515070.841395s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:43732 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[515185.917709s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45990 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[516061.897349s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:36508 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[516582.706765s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45688 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517068.129177s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:56128 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[517373.616350s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:32954 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517488.739828s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:35686 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Broken pipe (os error 32)
[517787.344259s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.119:60352 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[517974.381304s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:45394 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[518461.897379s]  INFO ThreadId(01) outbound:accept{peer.addr=10.0.130.174:33934 target.addr=10.0.128.58:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[518703.636162s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.119:50810 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)
[519482.176746s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:46966 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Transport endpoint is not connected (os error 107)

Our PostgreSQL instance is through RDS. The error that prompted me to start looking into this was a broken pipe error in our own services, though I'm wondering if the broken pipe log from the proxy is related. I have not been able to reproduce the error in our own services so that I can see it in action.

Hoping that my info that this is still occurring on stable-2.9.1 helps diagnose this.

DaemonDude23 commented 3 years ago

I ran into these errors when deploying Linkerd for the first time recently. I'm not seeing it mentioned here, but two other recent open issues relating to 2.9.1 are very similar to this issue, and supposedly it'll be corrected once https://github.com/linkerd/linkerd2-proxy/pull/782 is released.

olix0r commented 3 years ago

Without a repro config we can apply, it's difficult to know whether our recent changes will impact this, but there are two recent changes which may have some bearing on this:

Both of these changes will be available in the next edge release -- probably next week.

But, as I said, it's hard for us to know for sure without a configuration that we can test against.

Also, it's not clear to me from the above comments whether the problem is that the proxy is logging these messages or whether these are manifesting as an application problem... I.e. Do we need to just stop logging these messages at the INFO level? Or are we seeing unexpected connection issues that impact the application? If the latter case is true, it would be especially useful to see pcap data (i.e. from ksniff) that illustrates the sequence of packets causing problems.

isabey-cogni commented 3 years ago

@olix0r With regard to my comment, the only times this seems to have manifested as an application error is the os error 32.

[517488.739828s]  INFO ThreadId(01) inbound:accept{peer.addr=10.0.130.141:35686 target.addr=10.0.130.174:8080}: linkerd2_app_core::serve: Connection closed error=connection error: Broken pipe (os error 32)

Whenever those pop up they align with the application errors (also broken pipe).

creativearmenia commented 3 years ago

I am also seeing a lot of this kind of logs as @prasus in versions 2.9.1 and 2.9.0

Connection closed error=connection error: Broken pipe (os error 32)
Connection closed error=connection error: Transport endpoint is not connected (os error 107)

downgrade to 2.8.1 and logs gone

olix0r commented 3 years ago

Thanks, that's helpful.

But, as mentioned above, we'll probably need more details to identify the cause of this error. Ideally, we'd have a minimal application that we can deploy to reproduce the problem. If that's not possible, it would be helpful to know more about:

The broken pipe error just tells us that we're writing to a socket on which the reader has already closed. It's hard for us to know why this is occurring without more details about the application.

ipaleka commented 3 years ago

Hi @olix0r !

I suppose the same error I've got using Zalando's postgres-operator (spilo from above suggests that). If you have kind installed or have somewhere a cluster used for testing in the default namespace, here are the steps to reproduce the same error:

echo """# three node (two workers) cluster config
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
""" > kind-example-config.yaml

kind create cluster --config kind-example-config.yaml
linkerd install | kubectl apply -f -
kubectl wait --namespace linkerd --for=condition=ready pod --selector=linkerd.io/control-plane-component=controller --timeout=480s
kubectl get namespace default -o yaml | linkerd inject - | kubectl apply -f -

git clone https://github.com/zalando/postgres-operator.git
cd postgres-operator

kubectl create -f manifests/configmap.yaml
kubectl create -f manifests/operator-service-account-rbac.yaml
kubectl create -f manifests/postgres-operator.yaml
kubectl wait --for=condition=ready pod -l name=postgres-operator --timeout=480s
kubectl create -f manifests/minimal-postgres-manifest.yaml
kubectl wait --for=condition=ready pod -l statefulset.kubernetes.io/pod-name=acid-minimal-cluster-0 --timeout=480s

kubectl logs acid-minimal-cluster-0 -c postgres

And the output is a plenty of lines such:

2021-01-13 21:33:15,559 ERROR: Request to server https://172.18.0.4:6443 failed: MaxRetryError("HTTPSConnectionPool(host='172.18.0.4', port=6443): Max retries exceeded with url: /api/v1/namespaces/default/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dacid-minimal-cluster (Caused by ProtocolError('Connection aborted.', OSError(0, 'Error')))",)

The same happens with the last edge too.

It won't happen with stable 2.8.1, but I suppose that's because of mTLS functionality - injecting the namespace in that version won't automatically affect created pods.

Edit: forgot to add the repository cloning command...

olix0r commented 3 years ago

Thanks, @ipaleka. We'll try to take a look at this before the next stable release.

ipaleka commented 3 years ago

You're welcome! Please don't hesitate to ask If I may help with some other task, like those logs you mentioned or with something else.

dayglojesus commented 3 years ago

We are also seeing similar behaviour since upgrading from 2.8.1 to 2.9.2. None of our Services are "headless", but both HTTP(s) and connections to RDS backends are being severed.

# This is from an ALB to its target on ingress controller running on port 8443
# There are no corresponding errors from the ingress controller.
January 21st 2021, 20:06:56.892 linkerd-proxy   [174737.815747s]  INFO ThreadId(01) inbound:accept{peer.addr=10.170.21.27:23638 target.addr=10.170.24.213:8443}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
# This is from an app to a backend RDS
# Again, no errors in the app log, just the data plane
January 21st 2021, 20:33:31.391 | hubs-sit1 | linkerd-proxy | [  5375.428334s]  INFO ThreadId(01) outbound:accept{peer.addr=10.170.20.247:58974 target.addr=10.170.25.203:5432}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)

I don't know if this is new behaviour or simply more verbose logging. Whatever the case, if you cna lend any advice on debugging or if there's any more information I can provide to assist, let me know.

AdallomRoy commented 3 years ago

Also experiencing this, would love to get advice - Thanks

uqix commented 3 years ago

Experienced this error after injecting for one service with others uninjected, restarting the gateway solved it, maybe it is because of connection pooling?

linkerd2_app_core::serve: Connection closed error=error shutting down connection: Transport endpoint is not connected
Neru007 commented 3 years ago

Facing Similar Issue. Linkerd Proxy closing the outbound connection from one of my service to another service . Few of initial attempts get successful but connection get closed after a while. Attached is the ksniff File for the same. sniff.pcap.zip

Application is using : HTTP 1.1 Protocol

adleong commented 3 years ago

I'll just add that we've just released stable-2.9.3 which adds port 5432 which is used by Amazon RDS and Postgres to the default list of skipped ports. This might be a suitable workaround for some of you.

adleong commented 3 years ago

For everyone experiencing this issue, it would be super helpful to know if you still see this type of error on a recent edge release! A lot has changed since stable-2.9.x and this information would really help us narrow things down.

dayglojesus commented 3 years ago

@adleong fwiw stable-2.9.3 has substantially reduced these errors and the signal to noise ratio is vastly improved but we are still seeing Connection closed error=connection error: Transport endpoint is not connected (os error 107). There doesn't appear to be any pattern (no specific ports or services). Unfortunately, upgrading to an edge release isn't something we can do.

dayglojesus commented 3 years ago

@adleong fwiw stable-2.9.3 has substantially reduced these errors and the signal to noise ratio is vastly improved but we are still seeing Connection closed error=connection error: Transport endpoint is not connected (os error 107). There doesn't appear to be any pattern (no specific ports or services). Unfortunately, upgrading to an edge release isn't something we can do.

I should add we are also seeing a number of linkerd2_app_core::serve: Connection closed error=TLS detection timeout errors. It looks like Pods are having trouble communicating with linkerd-destination.

olix0r commented 3 years ago

@dayglojesus are you able to try using a more recent proxy (i.e. without upgrading the control plane)? You can set the pod spec annotation config.linkerd.io/proxy-version: edge-21.2.2 -- this should at the very least give us some better error messages when problems do arise.

I should also note that, on the control-plane side, #5722 may help if we're indeed encountering slowness caused by the destination controller.

Today's edge should basically be a Release Candidate for stable-2.10; so, even if you're unable to completely upgrade to an edge release, it would be helpful to know whether these issues persist with recent proxies.

ipaleka commented 3 years ago

Today's edge should basically be a Release Candidate for stable-2.10; so, even if you're unable to completely upgrade to an edge release, it would be helpful to know whether these issues persist with recent proxies.

I've reproduced steps from this comment and it still produces the same errors. :(

$ linkerd version
Client version: edge-21.2.3
Server version: edge-21.2.3
ericsuhong commented 3 years ago

We have upgraded from linkerd 2.8.1 to 2.9.4, and we have noticed increased # of such errors coming from linkerd-proxy and potentially increased # of 502 errors coming from ingress controller.

For example, an old pod that is still using 2.8.1, if I look at the linkerd-proxy logs, it is very clean:

[     0.37790647s]  INFO linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.37819348s]  INFO linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.37824848s]  INFO linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.37828648s]  INFO linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.37832148s]  INFO linkerd2_proxy: Local identity is default.reporting.serviceaccount.identity.linkerd.cluster.local
[     0.37836348s]  INFO linkerd2_proxy: Identity verified via linkerd-identity.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.37840948s]  INFO linkerd2_proxy: Destinations resolved via linkerd-dst.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.41201997s]  INFO linkerd2_app_inbound: Serving listen.addr=0.0.0.0:4143
[     0.138174002s]  INFO daemon:identity: linkerd2_app: Certified identity: default.reporting.serviceaccount.identity.linkerd.cluster.local
[     4.658855040s]  WARN inbound:accept{peer.addr=10.248.5.230:48790}:source{target.addr=10.248.6.15:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[     9.658622853s]  WARN inbound:accept{peer.addr=10.248.5.230:48800}:source{target.addr=10.248.6.15:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[    14.658934421s]  WARN inbound:accept{peer.addr=10.248.5.230:48818}:source{target.addr=10.248.6.15:80}: linkerd2_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[ 23193.473047846s]  WARN dst: linkerd2_reconnect::service: Service failed error=channel closed
[ 60493.839248768s]  WARN daemon:identity: linkerd2_reconnect::service: Service failed error=channel closed

However, a new pod that is using 2.9.4, if I look at the linkerd-proxy logs, I see MANY such os errors:

[  6533.244080s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:53580 target.addr=40.112.241.43:14601}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  6591.986113s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:56414 target.addr=40.112.241.43:14722}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  6652.857400s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:24518 target.addr=40.78.226.25:14048}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  6661.137809s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:24265 target.addr=40.78.226.25:14399}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  6713.008619s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:56244 target.addr=40.112.241.43:14601}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  6755.217364s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:23595 target.addr=40.78.226.2:14003}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  6769.170203s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:44140 target.addr=40.112.241.43:14964}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  6814.695001s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:13320 target.addr=40.78.226.25:14304}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  6834.563522s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:34642 target.addr=40.112.241.43:14722}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  6862.306509s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:23398 target.addr=40.78.226.25:14451}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  6903.966484s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:24298 target.addr=52.226.18.140:14459}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  7004.557986s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:24228 target.addr=40.78.226.2:14476}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  7108.638437s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:24951 target.addr=40.78.226.25:14016}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  7139.680743s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:41831 target.addr=40.78.226.25:14044}: linkerd2_app_core::serve: Connection closed error=Connection reset by peer (os error 104)
[  7239.795991s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:15016 target.addr=40.78.226.25:14011}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
[  7322.434427s]  INFO ThreadId(01) outbound:accept{peer.addr=10.248.39.35:25192 target.addr=40.78.226.25:14408}: linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107)
sebastian-schuth commented 3 years ago

We see the same problem in our installation. Upgrading to 2.9.3 seemed to fix the issue, but it has reappeared since them, after roughly 5 days - this might hint at some problem when running for some time (?)

It would be great to know if there were patches for this in 2.9.4 in order to judge if an update is worth trying.

olix0r commented 3 years ago

@ericsuhong you mention seeing 502 errors but none of the logs you share indicate HTTP-level errors. Is this traffic HTTP traffic? Or is this another protocol? 2.9 includes changes that reduce the amount of time that an idle connection may remain cached in the proxy, and so we may expect to see more connection closed messages... but without more details it's hard to know what to expect.

@sebastian-schuth I don't anticipate that 2.9.4 would have changed anything with regard to this.

In all of these cases, 'connection closed' messages are just an indication that a connection was closed. In each case we need to understand more about the traffic being proxied and the application state. If we moved these messages from INFO to DEBUG, what would we look at to better-understand the problem? Do peer proxies have correlated errors?

I'd also recommend trying the most recent proxy version with config.linkerd.io/proxy-version: edge-21.2.4 -- at the very least, newer proxies should indicate more about which connection was actually closed (i.e. app-to-proxy server connections vs proxy-to-peer client connections).

olix0r commented 3 years ago

@ipaleka

I'm actually unable to reproduce the issue with your provided steps...

I varied my setup slightly, using a single-node k3d cluster and linkerd-2.9.4, but I can't imagine how those differences would matter to this issue.

I notice that error you observe indicates an error connecting to the Kubernetes API server (and not Postgres traffic)

2021-01-13 21:33:15,559 ERROR: Request to server https://172.18.0.4:6443 failed: MaxRetryError("HTTPSConnectionPool(host='172.18.0.4', port=6443): Max retries exceeded with url: /api/v1/namespaces/default/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dacid-minimal-cluster (Caused by ProtocolError('Connection aborted.', OSError(0, 'Error')))",)

Might this be a startup ordering issue? where the Postgres operator tries to issue the requests before the proxy has become ready?

In any case, if you want help looking into this, you should probably create a fresh issue scoped to exactly your problem.

In general, this specific issue has become a bit overloaded to catch what seem to be several different types of problems, all of which cause logging of 'connection closed' messages. We're eager to help folks debug/diagnose these problems, but I think it would be more productive to scope this work into new use-case-focused issues.

I'd like to close this issue in favor of new, more-focused issues; but before I do I want to circle back to @KIVagant much earlier comment:

Update2: and many errors from http communications. It looks like the verbosity of the proxy was changed. And we probably need to switch the verbosity to warning/errors only.

I think for many of the people who have pinged this issue, the actual problem is that these messages are being logged? Would it be helpful if we moved these messages down to DEBUG to prevent them from showing up in your logs?

ericsuhong commented 3 years ago

@olix0r We have noticed an occurrence of 502 errors from linkerd after we upgraded to 2.9.4.

Frequency of 502s were very low, but consistent enough (1~2 times every hour). Compared to that, we almost NEVER had 502s in 2.8.1.

Seems like 502 errors did not correlate with log connection closed errors. Although, occurrences were very low, with so many "log connection closed" messages, we decided to roll back to 2.8.1.

emcay commented 3 years ago

Seeing this issue in 2.9.0 on v1.18.9-eks-d1db3c.

linkerd2_app_core::serve: Connection closed error=Transport endpoint is not connected (os error 107

olix0r commented 3 years ago

@dmbeck-e2x @emcay Please see my comment above. It would be helpful to open new issues with details about the problem you are experiencing.

olix0r commented 3 years ago

@christianhuening mentioned to me that this may be related to node-level conntrack limits... would be interesting to see if there are any dmesg logs indicating the conntrack table is full when this is observed.

tyzbit commented 3 years ago

I have this issue and my dmesg output doesn't have many messages, and none that I see pertaining to networking.

michalpacholczyk commented 3 years ago

Hi!

Together with upgrade of kubernetes from 1.10 to 1.18, we decided to upgrade Linkerd as well, from 1.x to latest stable. However, we encountered issues described in this thread. We also tried out v. 2.8.1., but that turned out to unacceptable slower than current system and system with latest Linkerd version. In all presented figures the only different is a Linkerd version.

image

The same is observable at gRPC level:

image

It would be great however when pods are restarted and load applied massive amount of 502s is being returned. This is found in relation to number of exceptions from linkerd-proxy containers.... (note the time, spike correspond to start of linkerd 2.9.4 test)

image

we also have a simple counter which count number of errors and its codes. Absolute value is no relevant but notice an increase in counter:

image

Ids there any update on that issue? Or at least a way to descrease/flatten respond time with Linkerd 2.8.1

vukhanhtruong commented 3 years ago

I'm having the same issue with ericsuhong

adleong commented 3 years ago

@michalpacholczyk Thanks for the detailed report! We have recently released Linkerd stable-2.10.0 and I'd recommend trying that version. If you still encounter problems, I'd recommend opening a new issue where we can focus on the specifics of the problems you're encountering.

adleong commented 3 years ago

@vukhanhtruong I'd recommend opening a new issue with as many details and specifics as possible. This will make tracking down the issue much easier.

adleong commented 3 years ago

Since there doesn't seem to be anything actionable here, I'm going to close this issue. I recommend that anyone who sees this type of behavior in Linkerd stable-2.10.0 open a new issue with as many details and specifics as possible.