linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.63k stars 1.28k forks source link

Missing client_id label on inbound requests to destination controller port 8090 #7861

Open klingerf opened 2 years ago

klingerf commented 2 years ago

What is the issue?

I'm not seeing a client_id label on any of the response_total stats that are exported by the inbound proxy of the linkerd-destination pod when the target port is 8090 (policy), but I am seeing that label set when the target port is 8086 (destination).

It's a bit easier to illustrate with a comparison of these two promql queries:

image

Maybe this is intentional? Without a client_id label set for requests to 8090, however, we can't dedupe traffic to that port.

How can it be reproduced?

Install linkerd and linkerd-viz, then:

kubectl -n linkerd-viz port-forward svc/prometheus 9090

Visit http://localhost:9090 and run the following query:

sum(response_total{direction="inbound", deployment="linkerd-destination"}) by (target_port, client_id)

Logs, error output, etc

See above

output of linkerd check -o short

Linkerd core checks
===================

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2022-03-10T08:39:24Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

Linkerd extensions checks
=========================

linkerd-multicluster
--------------------
× service mirror controller has required permissions
    missing ServiceAccounts: linkerd-service-mirror-aks-viz
missing ClusterRoles: linkerd-service-mirror-access-local-resources-aks-viz
missing ClusterRoleBindings: linkerd-service-mirror-access-local-resources-aks-viz
missing Roles: linkerd-service-mirror-read-remote-creds-aks-viz
missing RoleBindings: linkerd-service-mirror-read-remote-creds-aks-viz
    see https://linkerd.io/2/checks/#l5d-multicluster-source-rbac-correct for hints
× service mirror controllers are running
            * no service mirror controller deployment for Link aks-viz
    see https://linkerd.io/2/checks/#l5d-multicluster-service-mirror-running for hints

Status check results are ×

Environment

$ linkerd version --short        
edge-22.2.1
edge-22.2.1
$ kubectl version --short
Client Version: v1.22.5
Server Version: v1.21.2

Possible solution

This might be working as expected, in which case we can close it.

Additional context

No response

Would you like to work on fixing this bug?

no

olix0r commented 2 years ago

Spot-checking this locally:

:; linkerd diagnostics proxy-metrics -n linkerd po/linkerd-destination-774dbddb7f-q7wnz  | grep -e ^response_total
response_total{direction="inbound",authority="linkerd-dst-headless.linkerd.svc.cluster.local:8086",target_addr="10.42.3.12:8086",target_ip="10.42.3.12",target_port="8086",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 7
response_total{direction="inbound",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 1391
response_total{direction="inbound",target_addr="10.42.3.12:9990",target_ip="10.42.3.12",target_port="9990",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 6861
response_total{direction="inbound",target_addr="10.42.3.12:9996",target_ip="10.42.3.12",target_port="9996",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 6860
response_total{direction="inbound",target_addr="10.42.3.12:9997",target_ip="10.42.3.12",target_port="9997",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 1392
response_total{direction="inbound",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 6871
response_total{direction="inbound",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="503",classification="failure"} 1
response_total{direction="inbound",target_addr="10.42.3.12:9997",target_ip="10.42.3.12",target_port="9997",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 6861
response_total{direction="inbound",target_addr="10.42.3.12:9996",target_ip="10.42.3.12",target_port="9996",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated",status_code="200",classification="success"} 1392

We don't actually see any repsonse_total metrics for any of the controller ports. Presumably because all of these requests are long-lived streams, so the response never completes. (Edit: we see one for prometheus, explained below)

Why are you seeing response_total metrics for 8086 and not 8090? My guess is that destination queries can actually complete if the proxy drops a stack for a given service (i.e. evicting from the proxy's cache). But policy streams will never be dropped until the client proxy shuts down.

If we look at the request_total metrics we see connections we'd expect:

:; linkerd diagnostics proxy-metrics -n linkerd po/linkerd-destination-774dbddb7f-q7wnz  | grep -e ^request_total                   
request_total{direction="inbound",authority="linkerd-dst-headless.linkerd.svc.cluster.local:8086",target_addr="10.42.3.12:8086",target_ip="10.42.3.12",target_port="8086",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 8
request_total{direction="inbound",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 1391
request_total{direction="inbound",target_addr="10.42.3.12:9990",target_ip="10.42.3.12",target_port="9990",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 6861
request_total{direction="inbound",target_addr="10.42.3.12:9996",target_ip="10.42.3.12",target_port="9996",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 6860
request_total{direction="inbound",authority="linkerd-dst-headless.linkerd.svc.cluster.local:8086",target_addr="10.42.3.12:8086",target_ip="10.42.3.12",target_port="8086",tls="true",client_id="linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 3
request_total{direction="inbound",target_addr="10.42.3.12:9997",target_ip="10.42.3.12",target_port="9997",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 1391
request_total{direction="inbound",target_addr="0.0.0.0:4191",target_ip="0.0.0.0",target_port="4191",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 6870
request_total{direction="inbound",authority="linkerd-dst-headless.linkerd.svc.cluster.local:8086",target_addr="10.42.3.12:8086",target_ip="10.42.3.12",target_port="8086",tls="true",client_id="tap.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 1
request_total{direction="inbound",authority="linkerd-policy.linkerd.svc.cluster.local:8090",target_addr="10.42.3.12:8090",target_ip="10.42.3.12",target_port="8090",tls="true",client_id="linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 7
request_total{direction="inbound",authority="linkerd-policy.linkerd.svc.cluster.local:8090",target_addr="10.42.3.12:8090",target_ip="10.42.3.12",target_port="8090",tls="true",client_id="",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 16
request_total{direction="inbound",target_addr="10.42.3.12:9997",target_ip="10.42.3.12",target_port="9997",tls="no_identity",no_tls_reason="no_tls_from_remote",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 6860
request_total{direction="inbound",authority="linkerd-dst-headless.linkerd.svc.cluster.local:8086",target_addr="10.42.3.12:8086",target_ip="10.42.3.12",target_port="8086",tls="true",client_id="tap-injector.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 1
request_total{direction="inbound",target_addr="10.42.3.12:9996",target_ip="10.42.3.12",target_port="9996",tls="true",client_id="prometheus.linkerd-viz.serviceaccount.identity.linkerd.cluster.local",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 1391

That mostly looks like I'd expect, with one exception:

request_total{direction="inbound",authority="linkerd-policy.linkerd.svc.cluster.local:8090",target_addr="10.42.3.12:8090",target_ip="10.42.3.12",target_port="8090",tls="true",client_id="",srv_name="default:all-unauthenticated",saz_name="default:all-unauthenticated"} 16

This claims there's a policy lookup from a pod that doesn't have a client identity. Perhaps this is the identity controller starting up? We'll probably want to look more closely at it.

olix0r commented 2 years ago

Thinking about this a bit more, I expect there is some sort of race here during startup. Proxies may start watching policy before the proxy has provisioned its certificate, in general. I'm not sure that's really a problem, exactly, from Linkerd's perspective, though...

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.