Open deusxanima opened 1 year ago
Hi @deusxanima.
The Linkerd gateway is designed to only accept connections from meshed sources, so this is the behavior I would expect if unmeshed workloads in kube-system are trying to connect to it. It would be good to understand what these kube-system resources are doing and why they are attempting to initiate connections to the Linkerd gateway. I think understanding that is important before we decide if it's reasonable to special case our logging in any way.
Weighing in here. The environment is EKS and AKS, but where are we seeing these logs? On the EKS or AKS side? Or both? On of our users reported seeing this in AKS; for LoadBalancer-type Service objects, apparently, by default we will have some TCP probes configured that target port 4143 (inbound). If the probes would correctly target the admin endpoint I suspect this wouldn't be issue. It seems to be a very similar issue. Full conversation is in https://github.com/linkerd/linkerd2/discussions/8148 but for convenience I quoted the original statement:
After experiencing this on AKS too, we tracked it down to the load balancer's health probe.
In short, this is not a problem however the logs are noisy and can hide other issues.
The issue seems to be for the proxy's data plane port (4131 by default), which wants mTLS traffic if directly addressed.
From the load balancer's TCP probe perspective: a TCP connection that is accepted and then closed remotely is considered a success. However from the linkerd-proxy's perspective, a non mTLS connection is bad and so it closes the connection, producing the log.
We had success with using some annotations on the linkerd-gateway service, indicating that the probe for the data plane should use the admin port's /ready endpoint.
In the helm values for the multicluster chart this would look something like (untested):
gateway: serviceAnnotations: "service.beta.kubernetes.io/port_4191_health-probe_protocol": "Http" "service.beta.kubernetes.io/port_4191_health-probe_request-path": "/ready" "service.beta.kubernetes.io/port_4143_health-probe_protocol": "Http" "service.beta.kubernetes.io/port_4143_health-probe_port": "4191" "service.beta.kubernetes.io/port_4143_health-probe_request-path": "/ready"
See https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#custom-load-balancer-health-probe-for-port for docs on what these annotations do.
@mateiidavid We sunk a few hours into this and I suppose other AKS folks are too, would it make sense to make these annotation a default or maybe add this to the docs somewhere? Maybe time to promote this from a discussion to an issue?
@mateiidavid , got clarification that the user seeing this in EKS was seeing a separate issue so I've edited original description to only cover AKS. I've made note of the recommended annotations to quiet down the logs and will work with affected users to apply them in the interim. I do wonder if it would make sense to make these annotations default for multicluster gateway deployments in order to avoid the log noise but I'll leave that to maintainer team to make a decision on.
Encountered this issue again while troubleshooting a few multicluster gateway liveness issues which made it very difficult to separate the noise out from the actual errors and required back-and-forth with user to gather info to match up logs with kube system IPs. Would be great to have the discussed annotations be made default in the charts going forward to help new users set up multicluster feature without overwhelming them with potentially uneccessary logs.
These log lines are useful for debugging, but are generally not actionable. The preferred solution is to use the DEBUG
log level rather than INFO
.
For reference, here is the annotations that was working on our EKS setup:
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "HTTP"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/ready"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "4191"
Can I work on this issue? Thanks!
What is the issue?
After setting up a multicluster gateway on EKS/AKS/public cloud, the linkerd-gateway pod logs begin to fill with
[11289622.527652s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=direct connections must be mutually authenticated error.sources=[direct connections must be mutually authenticated] client.addr=xx.xx.xx.xx:yyyy
messages which seem to come in pairs and at regular intervals. Cross referencing the IPs with those in the cluster shows them belonging tokube-system
resources.How can it be reproduced?
Logs, error output, etc
Linekrd Proxy Logs:
Search for 10.240.0.4:
output of
linkerd check -o short
n/a
Environment
AKS
Possible solution
Either exempt the kube-system probes from logs to avoid spam, or wrap up kube-system probe logs into a clean one-line summary
Additional context
No response
Would you like to work on fixing this bug?
None