Open fullykubed opened 1 month ago
Turns out the metrics endpoint on port 4191 is not supposed to be served behind TLS; only traffic intended for the main container is. You can verify this by looking at the logs at any linkerd-init
container in an injected pod; in particular you'll see the following iptables rule:
msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568"
which means inbound traffic to those ports (including 4191) is let through untouched and not forwarded to the proxy. Only traffic to the proxy is then wrapped in mTLS.
Port 4191 is the admin port for the sidecar proxy. I believe the rule you highlighted is intended to ensure that traffic to this port isn't forwarded via the proxy but rather handled directly by the proxy itself. That said, I don't think traffic to the admin port is supposed to be unencrypted.
For example, while I have highlighted some instances where it is not, the vast majority (~98%) of requests to the :4191/metrics
endpoint are marked as tls=true
.
query: sum(rate(request_total{direction="outbound", tls="true", target_addr=~".*4191"}[3h])) by (namespace, pod, target_addr, dst_namespace, tls, no_tls_reason, dst_service, dst_pod_template_hash) * 3 * 60 * 60 > 0
results
My bad, you're actually right, traffic to 4191 is supposed to be encrypted. There are no rules enforcing that though as you can see. no_tls_reason="not_provided_by_service_discovery"
means that linkerd-destination
wasn't able to provide an identity for that target, so the client falls back to a plain-text request. As you pointed out, this might happen during pod recycling when there can be a transient inconsistency between the state observed by the destination controller and the prometheus client, but it should resolve eventually.
If it is important to encrypt this traffic for a specific applicaiton, could an AuthorizationPolicy (etc) be created that would enforce that requirement, and deny non-encrypted requests during these transient periods?
Thanks for the clarification. However, I do want to note that this doesn't appear to be a transient issue during startup. It continues to affect some pods for their entire lifetime.
Perhaps once the unauthenticated TCP connection is established, it is reused indefinitely?
While I am not sure what other endpoints the admin port exposes, it seems somewhat concerning that anything can access it without authentication or encryption. You would know better than I do about the implications here, but is there an easy way to completely disable all non-mTLS traffic to this port across the entire cluster?
You can change the default policy at the cluster level (via the option proxy.defaultInboundPolicy="all-authenticated") or at the namespace or workload level as explained in the docs. That will however deny all traffic to meshed pods from unmeshed pods.
To specifically deny traffic to the metrics endpoint you could set up a Server resource for the linkerd-admin
port (4191) with an empty podSelector
so that all pods in the namespace are selected. You'd have to deploy one of these per namespace:
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
namespace: emojivoto
name: metrics
spec:
podSelector: {}
port: linkerd-admin
proxyProtocol: HTTP/1
and then an AuthorizationPolicy (also one per namespace) that would grant access only to the prometheus ServiceAccount (adjust SA and namespace according to your case):
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
namespace: emojivoto
name: web-metrics
labels:
linkerd.io/extension: viz
spec:
targetRef:
group: policy.linkerd.io
kind: Server
name: metrics
requiredAuthenticationRefs:
- kind: ServiceAccount
name: prometheus
namespace: linkerd-viz
What is the issue?
When checking my Linkerd metrics to ensure that all cluster traffic is encrypted as expected, it appears that sometimes communicating with the Linkerd2 proxies metrics endpoint happens without encryption.
There does not appear to be a discernible pattern:
How can it be reproduced?
Install Linkerd via Helm chart with the following settings:
sum(rate(request_total{direction="outbound", tls!="true", target_addr=~".*4191"}[5m])) by (namespace, pod, target_addr, dst_namespace, no_tls_reason, dst_service, dst_pod_template_hash) * 5 * 60 > 0
Logs, error output, etc
Metrics from Grafana
Logs of of prometheus (grepped for the string
4191
):Logs from the prometheus sidecar:
Note the following:
target_addr
shown in the metrics does not appear in the logs even though everything was taken concurrently.output of
linkerd check -o short
Environment
1.29
on EKS2024.5.1
Possible solution
No response
Additional context
The scraping itself completes successfully with no errors.
Would you like to work on fixing this bug?
None