Prometheus metrics scrapes of `linkerd-proxy` are not TLS protected (occassionally)

fullykubed commented 1 month ago

What is the issue?

When checking my Linkerd metrics to ensure that all cluster traffic is encrypted as expected, it appears that sometimes communicating with the Linkerd2 proxies metrics endpoint happens without encryption.

There does not appear to be a discernible pattern:

Sometimes affects one prometheus instance but not another for the same target
Affects long-running pods as well as short-lived ones (but does appear more frequently when pods are deleted / replaced)
Does not appear correlated at all with the log warnings posted below

How can it be reproduced?

Install promtheus via prometheus-operator (not via the Linkerd helm charts)

Install Linkerd via Helm chart with the following settings:

  proxy = {
    nativeSidecar = true
  }
  podMonitor = {
    enabled = true
    scrapeInterval = "60s"
    proxy = {
      enabled = true
    }
    controller = {
      enabled = true
    }
  }

Ensures all pods have linkerd sizecar running
Run query sum(rate(request_total{direction="outbound", tls!="true", target_addr=~".*4191"}[5m])) by (namespace, pod, target_addr, dst_namespace, no_tls_reason, dst_service, dst_pod_template_hash) * 5 * 60 > 0

Logs, error output, etc

Metrics from Grafana

Logs of of prometheus (grepped for the string 4191):

{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.64:4191/metrics","ts":"2024-05-21T19:56:51.693Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.253:4191/metrics","ts":"2024-05-21T19:56:52.833Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.77:4191/metrics","ts":"2024-05-21T19:56:57.830Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.66:4191/metrics","ts":"2024-05-21T19:57:03.366Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.169:4191/metrics","ts":"2024-05-21T19:58:08.781Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.164:4191/metrics","ts":"2024-05-21T19:58:21.529Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.168:4191/metrics","ts":"2024-05-21T19:58:28.870Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.109.231:4191/metrics","ts":"2024-05-21T20:01:16.804Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.109.225:4191/metrics","ts":"2024-05-21T20:01:23.087Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.121.80:4191/metrics","ts":"2024-05-21T20:01:54.968Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.130:4191/metrics","ts":"2024-05-21T20:03:02.335Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.25:4191/metrics","ts":"2024-05-21T20:03:36.445Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.16:4191/metrics","ts":"2024-05-21T20:03:58.239Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.17:4191/metrics","ts":"2024-05-21T20:04:21.187Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.178.20:4191/metrics","ts":"2024-05-21T20:04:43.453Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.178.19:4191/metrics","ts":"2024-05-21T20:05:03.208Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.104.233:4191/metrics","ts":"2024-05-21T20:08:55.733Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.136:4191/metrics","ts":"2024-05-21T20:09:24.386Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.176.185:4191/metrics","ts":"2024-05-21T20:09:37.523Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.240:4191/metrics","ts":"2024-05-21T20:10:27.292Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.104.239:4191/metrics","ts":"2024-05-21T20:23:59.602Z"}

Logs from the prometheus sidecar:

{"timestamp":"[     0.007020s]","level":"INFO","fields":{"message":"Admin interface on [::]:4191"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[    85.976336s]","level":"INFO","fields":{"message":"HTTP/1.1 request failed","error":"endpoint 10.0.240.149:4191: error trying to connect: Connection refused (os error 111)"},"target":"linkerd_app_core::errors::respond","spans":[{"name":"outbound"},{"addr":"10.0.240.149:4191","name":"proxy"},{"addr":"10.0.240.149:4191","name":"forward"},{"client.addr":"10.0.165.230:35584","name":"rescue"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   107.893526s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   108.003185s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   108.223752s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   108.659528s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   109.161250s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   109.663315s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.165180s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.668094s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.893858s]","level":"WARN","fields":{"message":"Service entering failfast after 3s"},"target":"linkerd_stack::failfast","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.894000s]","level":"INFO","fields":{"message":"HTTP/1.1 request failed","error":"logical service 10.0.240.147:4191: route default.endpoint: backend default.unknown: service in fail-fast"},"target":"linkerd_app_core::errors::respond","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"client.addr":"10.0.165.230:51094","name":"rescue"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   111.170139s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   111.671865s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   112.174022s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   112.676552s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   113.178600s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   113.680637s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   114.182241s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   114.685151s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   115.186904s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   115.689632s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}

Note the following:

The grafana metrics show the complete set of results, but I have over 100 pods running with the sidecar enabled with 2 prometheus instances doing the scraping. As a result, it seems like only a small fraction of the linkerd proxy metrics scrapes are unencrypted.
The warning and error logs do not appear to be correlated with the TLS metrics as the target_addr shown in the metrics does not appear in the logs even though everything was taken concurrently.

output of `linkerd check -o short`

linkerd-version
---------------
‼ cli is up-to-date
    is running version 24.5.1 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.1 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-5d77bff859-625dv (edge-24.5.1)
    * linkerd-destination-5d77bff859-ppx9g (edge-24.5.1)
    * linkerd-identity-77445c7bf4-rrmk8 (edge-24.5.1)
    * linkerd-identity-77445c7bf4-zd6bz (edge-24.5.1)
    * linkerd-proxy-injector-844bcc688-wht62 (edge-24.5.1)
    * linkerd-proxy-injector-844bcc688-xdgjv (edge-24.5.1)
    * metrics-api-686fdb9cd5-wsh69 (edge-24.5.1)
    * tap-5f69747c7c-rd7lh (edge-24.5.1)
    * tap-5f69747c7c-v4rsg (edge-24.5.1)
    * tap-injector-6b4d546c8c-bc6kt (edge-24.5.1)
    * tap-injector-6b4d546c8c-j6lx7 (edge-24.5.1)
    * web-684f5c88cc-cfrt5 (edge-24.5.1)
    * web-684f5c88cc-xz7n7 (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-5d77bff859-625dv (edge-24.5.1)
    * linkerd-destination-5d77bff859-ppx9g (edge-24.5.1)
    * linkerd-identity-77445c7bf4-rrmk8 (edge-24.5.1)
    * linkerd-identity-77445c7bf4-zd6bz (edge-24.5.1)
    * linkerd-proxy-injector-844bcc688-wht62 (edge-24.5.1)
    * linkerd-proxy-injector-844bcc688-xdgjv (edge-24.5.1)
    * metrics-api-686fdb9cd5-wsh69 (edge-24.5.1)
    * tap-5f69747c7c-rd7lh (edge-24.5.1)
    * tap-5f69747c7c-v4rsg (edge-24.5.1)
    * tap-injector-6b4d546c8c-bc6kt (edge-24.5.1)
    * tap-injector-6b4d546c8c-j6lx7 (edge-24.5.1)
    * web-684f5c88cc-cfrt5 (edge-24.5.1)
    * web-684f5c88cc-xz7n7 (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-prometheus
    see https://linkerd.io/2/checks/#l5d-viz-prometheus for hints

Status check results are √

Environment

Kubernetes: 1.29 on EKS
Linkerd: 2024.5.1

Possible solution

No response

Additional context

The scraping itself completes successfully with no errors.

Would you like to work on fixing this bug?

None

alpeb commented 1 month ago

Turns out the metrics endpoint on port 4191 is not supposed to be served behind TLS; only traffic intended for the main container is. You can verify this by looking at the logs at any linkerd-init container in an injected pod; in particular you'll see the following iptables rule:

msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568"

which means inbound traffic to those ports (including 4191) is let through untouched and not forwarded to the proxy. Only traffic to the proxy is then wrapped in mTLS.

fullykubed commented 1 month ago

Port 4191 is the admin port for the sidecar proxy. I believe the rule you highlighted is intended to ensure that traffic to this port isn't forwarded via the proxy but rather handled directly by the proxy itself. That said, I don't think traffic to the admin port is supposed to be unencrypted.

For example, while I have highlighted some instances where it is not, the vast majority (~98%) of requests to the :4191/metrics endpoint are marked as tls=true.

query: sum(rate(request_total{direction="outbound", tls="true", target_addr=~".*4191"}[3h])) by (namespace, pod, target_addr, dst_namespace, tls, no_tls_reason, dst_service, dst_pod_template_hash) * 3 * 60 * 60 > 0

results

alpeb commented 1 month ago

My bad, you're actually right, traffic to 4191 is supposed to be encrypted. There are no rules enforcing that though as you can see. no_tls_reason="not_provided_by_service_discovery" means that linkerd-destination wasn't able to provide an identity for that target, so the client falls back to a plain-text request. As you pointed out, this might happen during pod recycling when there can be a transient inconsistency between the state observed by the destination controller and the prometheus client, but it should resolve eventually.

wmorgan commented 1 month ago

If it is important to encrypt this traffic for a specific applicaiton, could an AuthorizationPolicy (etc) be created that would enforce that requirement, and deny non-encrypted requests during these transient periods?

fullykubed commented 1 month ago

Thanks for the clarification. However, I do want to note that this doesn't appear to be a transient issue during startup. It continues to affect some pods for their entire lifetime.

Perhaps once the unauthenticated TCP connection is established, it is reused indefinitely?

While I am not sure what other endpoints the admin port exposes, it seems somewhat concerning that anything can access it without authentication or encryption. You would know better than I do about the implications here, but is there an easy way to completely disable all non-mTLS traffic to this port across the entire cluster?

alpeb commented 1 month ago

You can change the default policy at the cluster level (via the option proxy.defaultInboundPolicy="all-authenticated") or at the namespace or workload level as explained in the docs. That will however deny all traffic to meshed pods from unmeshed pods.

To specifically deny traffic to the metrics endpoint you could set up a Server resource for the linkerd-admin port (4191) with an empty podSelector so that all pods in the namespace are selected. You'd have to deploy one of these per namespace:

apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: metrics
spec:
  podSelector: {}
  port: linkerd-admin
  proxyProtocol: HTTP/1

and then an AuthorizationPolicy (also one per namespace) that would grant access only to the prometheus ServiceAccount (adjust SA and namespace according to your case):

apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  namespace: emojivoto
  name: web-metrics
  labels:
    linkerd.io/extension: viz
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: metrics
  requiredAuthenticationRefs:
    - kind: ServiceAccount
      name: prometheus
      namespace: linkerd-viz

linkerd / linkerd2